National Energy Research Scientific Computing Center (NERSC) NUG Monthly Webinars, 20 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG Monthly Meeting, May 20, 2021

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

um Okay, so just for the sake of the uh recording, we've just begun, we're studying the win of the month section and our recording for some reason didn't start automatically this week.

A

um Yeah, would anybody like to tell about something that you've uh yeah a win that you've had to kick us off.

B

uh Hi, this is uh koichi from pnnl, hey hi. I just put on a chat that we got the paper accepted and then published in the final form uh actually took some time. But this paper used and asked resources to you know develop an algorithm to track very strong thunderstorms that really produce a lot of precipitation, damaging a lot of infrastructure and also we use quality to run the computer models, climate models to a bunch of simulations and then also we used nasa resources to develop and use the machine.

B

Learning algorithm that automatically detects the atmospheric environment that favorable for producing such a strong storms, so very heavily relied on ask resources and we did acknowledge, of course the ask in the acknowledgement section. So I just put the uh the link to that paper. Yeah.

A

And thank you for that and that actually that's a that's a good point when, when you um uh submit papers that we're using um nurse resources, it's really helpful to you to us if you include an acknowledgement and we have um somewhere in our web page under www.nurse.gov, there um there's a kind of a format that you can use, but we use this to kind of this. This helps in our argument, for um you know, funding from the doe basically yeah.

B

Yeah yeah, I think we copied yeah copy that uh statement from the website. Yes,.

A

Yeah, so yeah, so thanks for doing that, that's uh really helpful. So the um machine learning for uh atmospheric work is is kind of an interesting field. So were you using the machine learning to to sort of speed up the solver or to look at the results of the the you know.

B

The traditional method and.

A

Identify things for.

B

This like for this site, it's more traditional. Actually, we use the self-organizing map uh type of machine learning to really you know find uh find out.

B

You know particular high dimension structure of atmosphere using the you know, you just compress all the different variables at different height levels, so that it's just difficult for us to do, and it's very non-linear processes, so machine learning. It's really nice to tease out those um patterns that are really hidden in the atmosphere and then the oe is really pushing to use machine learning in our field as well, so one this is more typical, traditional way.

B

But you know a bunch of us also working on to use machine learning to develop a predictive part of the model, particularly for those processes that the global model cannot resolve like turbulence and convictions, and uh yeah probably is somewhere down in the future. I'd like to post another work that some of us working on developing such a cloud must use machine learning, to study about the cloud and aerosol interactions and they use those machine, learning method or trained algorithm as a part of the global models, but uh yeah yeah.

B

Thank you, but this is more traditional one, but it's still very helpful.

A

Yes, yeah and I understand that um thunderstorms in atmospheric models are quite difficult to detect normally because they're yeah, they tend to be quite high resolution.

B

Yeah and it's very difficult and um could be very subjective traditionally, uh even we do not really strictly agree with what constitute, uh in our case, this particular strong storms, what we call mesoscale convective systems, but the first part of the this paper is really try to make it as objective as possible and as realistic as as possible. So it's just three stages.

B

You know that algorithm and the model simulations using a new uh global model using certain variable resolution grid and then apply finally apply machine learning algorithm to both observation models to more objectively, compare the reality and the model.

A

Right yeah, that sense sounds really good.

B

Yes, thank you.

A

Great work, um anybody else got a win or a success that they'd like to share.

A

So we can uh go to the the flip side of the coin here, which is which is today. I learned, then you know what we. What we do here is research and part of the the nature of research. Is that you get stuck on things you hit dead ends, you try, you try new things and they don't work climbing over get quite challenging learning curves, and this is actually yeah.

A

It can be a little painful and frustrating, but it's not a bad thing, because you know this is this is kind of how we, how we learn stuff and and how we uh get new knowledge and new discoveries out into the community so um yeah. So this section is kind of a opportunity to talk about. You know something that bit you something that tripped you up, or even just something interesting that you stumbled across.

A

You know a paper or a webinar, or something like that that had some new tips, something that might benefit um other nurse.

A

A

Oh so larry just posted a comment in the chat about a recently learned. Yes, this is something that is in our documentation, but doesn't get a whole lot of noise.

A

Do you want to describe what you found laurie.

C

Sure it turns out that things on home and cfs are automatically backed up for one week and there's this hidden directory called dot snapshots and that saved me recently. So I was glad to find that out and I wanted to let other people know that it exists.

A

And it's that directory is very difficult to find. Is the wrong word, but it's not obvious, because if you ls-la in your home directory uh it's not visible there, you, you kind of need to it's almost like you need to know about the the secret code. You have to ls-a dot snapshots and the system will show you what's inside that directory and if I remember rightly, uh I think larry looked at this more recently than me and can probably correct me. The snapshot gets taken once a day.

A

So if you break something today so long as it was on the system yesterday, you should be able to get it back.

C

Right, I think there were seven uh in there. So that's nice.

A

Yeah and I suspect that it gets taken sometime during the during the night. If I I don't remember offhand but yeah, that's uh that's quite a helpful tip. I've been saved for that one. A couple of times too.

A

So, for me, most of my kind of hard lessons over the last more than a week month um have been around tips and tricks using uh something called spack which you may have heard, of which we actually have on our system. You can you can module loads back the default. One at the moment is a fairly old version. It's uh 0.14.2, I think it's over a year old, um specs being rapidly actively developed, and so things change pretty quickly.

A

We, the latest version that we have is 0.16.1. So if you do a modular av, modular veil, spec you'll see that um and.

A

It's quite neat you you can describe using a dsl what you would like, so you can say yeah. I would like to install uh your package slate app version um on this stack with this compiler and the setup that we've got will install it by default into a directory in your home directory called sw. For for software um in the module file for 0.16.1, you can actually change that. There's a variable called something like spec preferred base.

A

If you do, a module show on the module file, you'll see it, but so this is a quite powerful tool that when everything goes smoothly it it turns the the nightmare of resolving all your dependencies and getting all the necessary things installed to get a particular piece of software.

A

Running uh to you know, just a a single command line, really a couple to check what it's going to do and then and then tell it to go ahead and do it, which, which is really nice, but uh of course software is complex and and things don't always work, and so yeah there are uh when it breaks. It can be a little challenging to find why so, I've been uh learning.

A

Quite a lot, partly by the spac web pages, but also particularly via specs slack channel, which is a quite the the group of developers and yeah community around that is actually quite helpful at answering things. So I'll share a link on this too.

A

Here we go spec.io.

A

There's a whole lot of information there. um We have uh some stuff actually about it on our web pages as well.

A

Under um yeah, don't ask.

A

But uh yeah give it a whirl. um You might find that uh it works really. Well, you know you might find that things are complicated, but you know you can also uh drop us a line and send us a ticket to um ask for assistance with it.

A

uh I see a couple of other things wrapped up in the chat. Oh hey over. Am I pronouncing that correctly? The maximum time you could run on a single gpu node is four hours.

A

um This is on curry gpu. Do you want to tell us about what you learned.

D

Oh yeah, I was uh yeah. I was just trying to run some hybrid parameter search on my neural network and I I was trying to call I I first tried to use on jupiter hub. I find that my job gets very kind of constantly killed after a certain period of time. I didn't really count how long that is, and then I switched to the configurable gpu.

D

So after I call that gpu, I think I can get like a stable running period of time, but that gpu only lasts for like four hours. uh So I was just wondering if there's a way to call the gpu for a longer time, because you know the hyperparameter search you already kind of scale, with the amount of the number of attempts you make so the longer you search the better model you will get for your network, so yeah some people in the chat has already given me some help. So I think that's that's very helpful.

D

I'll. Try to check it out.

A

Yeah so there's some good good tips there actually about um being able to uh tweak it or find the constraints and get running uh running overnight. So that's that's good to know.

A

So we are getting up to 25 past 11 um before we got a few more minutes. If uh does anybody else have a tip or trick or or something that they would like to learn? Because it's a sticking point.

A

And if not we'll move on to the next session.

A

All right so for our next section uh we have a space for announcements and calls for participation, so there's there's kind of a lot going on at the moment. um If you scanned through the weekly email this week, you'll have seen uh and uh emails that went out.

A

Big thing coming up is pearl. Mudder's dedication will be happening next thursday, almost at this time, actually slightly earlier at 10 30 a.m.

A

um We have a link here in the slides for the calendar. You should be able to see it in the weekly email and probably in email. So it's kind of a fairly exciting event that everybody at desk is paying a lot of attention to and yeah come along and and join the celebration.

A

We have a nurse plus nvidia gpu hackathon, coming up the original deadline for submissions- I think was yesterday, but it's been extended, so you have until the end of the week. So if you have a code that you're working on to get gpu ready- and you would like some assistance from experts from both nurse and nvidia that's still available- the web address here is in the slides.

A

um There's a few training events coming up, um there's some intro intro to nurse coming up on june 3. So it's in a couple of weeks. uh Some of you might have seen uh a pinterest parallel where which was a topic of the day a couple of months ago now, um so a picture was holding office hours for assistance getting up and going with parallel wear on june 9.

A

and I think there's a link in the weekly email to where you can uh would you call it make an appointment for those uh another training event that nesca is coming up is a crash course in super computing and another one that I'll give you like a heads up, one that hasn't actually been announced yet, but it will be very soon, is uh we'll be doing some training for using elmod. So uh some of you may already be familiar with elmore.

A

It gets used at a number of other sites, um but on corey we are using uh the modules environment that we use is tcl modules. It's kind of the original one elmod is a it's a little more than a re-implementation of it, but it's a um kind of a follow-on from it. uh The same sorts of ideas with uh you know a newer scripting language behind it, and you know a few. A few sort of updates.

A

uh So this is an announcement that was a little bit earlier, but we have a compile queue if you're doing um your long-running compiles or devops sort of work, the compile queue is uh you're meant to meet those needs a little bit more neatly than using a login node.

A

I think that previously you needed to request access to it. It's now available to all so I can see there's a bit of chat going on in the in the zoom chat. That's uh run through that, so that everybody has seen it.

A

uh So there's some discussion around um uh around the some of the uh training that, uh whereas uh I've always was talking about um doing uh checkpoint, restart as a way of stopping and starting things. I do. We just recently had a checkpoint, restart training session on mpi agnostic network agnostic, checkpoint restart. I think we have some notes on that in in this documentation pages, it's probably under running jobs, here, checkpoint restart, so yeah, particularly for for your serial and some mpi jobs. If it doesn't have its own built-in checkpointing dmtcp can be quite handy.

A

um You know it's also helpful to know that l mod supports pcl format, module files, yes, um and in on perlmutter we intend to be using uh l, mod, so yeah there will be a little bit of transition. It should be fairly smooth in the most things work in exactly the same way. There's a there's a couple of slight differences, um but just very usefully. If you have, particularly if you have your own module files, elmod understands tcl module files as well, uh which is compiling for hezbollah, not the knl.

A

I believe it is only for haswell.

A

um It is a good point, though so compiling on knl can be pretty slow, because I guess knl gets its performance out of having lots of chords not out of uh anyone calling particularly fast.

A

So for that for compiling on knl, you probably still at the moment at least need to use a regular job.

E

I think that's a good question, though um that's that's a good point to um you know, bring up that, uh since some codes for knl do need to be compiled on a knl node. um I I think we could discuss internally if we want to add a knl node to the um compile qos.

A

That's a that's a good point actually and uh because she's coming in about the debug queue time limit, because they know compilation, especially for a large code, can take a while.

A

So, yes, it is possible to cross-compile the knl from a login node, something that I've discovered- and probably several others of us here have- is that the part that cross compiling most often trips up on is if you're, using cmake or dot slash, configure and they're trying to run little. um They build and run executables to see. You know if things are available or if things work, and unless the package has been very well developed.

A

Quite often it will try to. You know, build and run some executable to test something and, of course, because you're cross-compiling, the knl um there. It doesn't work on the login node, because it's built using avx instruction sets so for that sometimes just going to a knl node through you know getting an interactive node just for the dot.

A

Slash configure step can help to sort of get through that part, and then you can go back to the login node to do the actual compiling, uh which can be a lot faster than doing the full compile on the kml mode.

A

um Yes, I think corey gpu is probably different enough, that it's a it's a bit of a challenge to cross, compile.

A

um The the corey gpu nodes have nvidia v100 gpus on them, so perlmutter will have a 100gb, so the next, the next sort of model along.

B

A

I have lost my window here. We go.

A

So that's the announcements that uh I know about does anybody else have um any announcements or cfps? That would be good for nurse users to know about.

A

If not we'll go on to our next section, which is our topic of the day, and today we're going to talk about running on gpus at nurse and particularly take a look at some example: job scripts for corey gpu, so kind of before we get started I'll, put some links here of things that are useful for when you're running on corey gpu.

A

Actually, the first one that I haven't put on there is that, if you're not on corey gpu, yet but you're working to prepare a prepare, some code to be gpu ready in preparation for perlmutter, you can request access to the quarry gpu nodes via.

A

Help.Nurse.Gov.

A

Seems like more work than is necessary, oh just because it's uh going through service now, but there's a there's. A link on that page for requests and one of the requests is access to query gpu nodes.

A

So other other helpful things we have some nurse 101 and other appointments uh available. This has sort of been there for a little while we now have a gpu basics and gpu using in python appointment types.

A

Clicking on these, for the slides doesn't helpful, jumps too far forwards. All right. um We can post a link later in the chat or, if you have the slides open in front of you, you'll be able to click on the link.

A

There so that um so we have a a few people uh with us that are part of the nissa program who have some uh corey gpu scripts to share and uh and give us a little bit of a walk through.

A

um So I think we have uh roe and kevin, and possibly daniel and jadi and laurie on. um Would any of you like to share a screen and show off a script and walk us through what you're doing.

F

I can go after someone has a very, very, very basic script, as there's one trick to mine that I don't want to confuse people with.

C

I can go first.

A

All right that sounds good I'll, stop sharing and hopefully you should be able to just click share and it allows you.

A

Yep, this is looking good.

C

Can you see.

A

Yes, that's working.

C

Okay, so this is a quick example: it's gonna demonstrate actually how to run a jupiter, notebook um and uh I'll I'll show you that this is my job script for corey gpu.

C

um So I'm doing an s batch c gpu, I'm asking for four gpus here, I'm emailing myself to let me know how the drive is going and uh for anybody who uses python and wants to source a custom content environment. This is how you do it um inside your script, so I load the python module and I source this environment called papermill. I've already built um papermill is this a library that allows you to run jupyter notebooks from the command line and also to insert overriding parameter cells. So papermill is really cool.

C

We have some docs about that um and here's here's where I'm launching it. So I have an sron where I ask for my gpus and then I have this script called run papermill. So that's it here.

D

I'm not going to explain everything but.

C

I import the library called paper mill. I do some art, parsing and stuff, but the powerful part is here. This is where I'm actually launching my jupiter notebook from the batch script, so I don't actually have to log into jupiter and I don't have to do anything interactively which is cool, um so I'm launching this notebook called save, that's flexible and then it's saving the output on cfs or pretty much wherever you want and then the thing that my notebook is actually doing then is spinning up a dash cluster and I'm not.

C

I can't explain everything here, but desk is kind of a python parallel task-based library. So I check that I have a gpu, I spin up a desk cluster and then I'm doing some qdf processing, so that's kind of pandas, but on a gpu and then uh when I'm done, I write some output files and shut my cluster down. So in this one job script, I'm able to start and run a bunch of jupyter notebooks. So I think it's pretty powerful.

C

um If you have questions about this, you can submit tickets about um desk or paper mill or uh rapids and they'll come to me, I'm happy to help you so.

A

That is really neat um yeah. I think I'll echo uh williams question.

C

Yeah, william, so I see a question about sharing a script at the moment. This is not quite ready to share, but we're gonna, publish, I'm gonna put up a public version of this stuff in the next week or two because it's going to be part of a paper at sci-fi, um but there's paths in here that we don't want users to see. But yes,.

D

C

We'll post it like in the next two weeks.

A

Sounds really good, so the um the workflow for developing. That would then be. I guess that you begin by requesting a interactive gpu node through jupiter.nurse.gov. Do your kind of development with paper mill there to you know for the gpu side of it and then go on to develop the the shell script separately.

C

Yeah, actually it's kind of backwards, so I I would start by logging into jupiter and you know developing some script interactively so it does what I need um and then, when I have what I want, I can wrap it in paper mill and then papermill will override certain cells for me. So I can do a parameter scan and then I can put that in my batch script but yeah the other. There are other solutions too, um so yeah, there's cheap text and I think, there's other options, but I like paper mill, it's easy to use.

C

So that's the one I've done.

A

Nice yeah, that was uh yeah. That's a a very neat example. Thank you.

A

um Kevin, do you want to uh go on to your script.

F

uh Sure, okay, let me make sure I've got it open here.

F

I got pulled into three separate slack meetings in that two second window: oh busy busy busy yeah.

A

There's a lot going on right now. Isn't it.

F

Yeah, okay, so um this is amrx's essentially sample run script, and I can throw this one in the chat for you guys to use as a reference. um It is our reference.

D

F

It is a bit outdated in some places, but it's outdated in a way that we actually depend on. uh So if you look at the example scripts in the docs, um here's what you normally get- and you see that most use this gpu per task to define that for this, given rank you're going to have so many gpus. So the only difference that we do compared to this is we we loosen this amorex inside of it looks at all visible, gpus and parses them out inside the code at initialization.

F

So it doesn't do this up front and limit what it can see, allowing if right now, it's really for testing and we don't really use it. We most of the time have one task for gpu like most people, but um we don't want to limit that right now, because we have various testing and you know experimental things that we're doing so.

F

That's really the only difference from it overall, but we have this code that actually uh this script actually has all the kind of things that you would need to do to run on query gpu, so uh gpu time limit java id, uh we used m1759 change it to yours and then all the tricky pits. We've actually got a note down here about what they are.

F

So a number of nodes, number of tasks, cpus per task and then, instead of gpu per task, we use grass equals gpu to get gpus per node and we always assign all eight as visible available and able to play around with once you get inside the code, so that that's a difference that we want to do.

F

What you probably want to do is change this to gpus per task is equal to one for most of your codes, but otherwise this works nicely, and then we have a couple examples here for for one node of corey gpu. You would set it up like this for two you'd set it up like this, so notice, cpc is per task and gpu is per node and and task is per node. So none of these numbers change you just change n and on big end and little n. We then have some slot commands.

F

So this is how you get on interactively. So if you do a single gpu, the line would look like this. Just getting a single gpu 10 is the even distribution of cpu threads and one node, and then a full single node using the exclusive command or multi-node using the exclusive command. If you wanted to- and this is where we set up our you- you put your executable in your input. We run like this, so an executable and then an input file.

F

So this is where you would we we defined it in our code, so you don't have to tweak much and then there's two launches here, one for you're running it in s batch. So you just launch like this. If you're in an interactive, node you'd want to put your configuration in here so then the s1 would look like this, and we have this also here to compare and look at it.

F

So if you, if you uncommented this and ran it like this on, while in an interactive session, you should get the equivalent so right on the interactive node, you run it and you'll run an s command with this exact config and you can tweak it the same way up here and then we also got down here is how to do the profiling.

F

um So here is n site systems, profiling, and this is the probably the one that's most useful. It just does a basic profile of the simula of the run. So there's your exe and your executable. We output it based on job id numbers to get a unique id every time. That's all it is output to this file, and this will give you a timeline.

F

You can look up all the profiling stuff on on nurse's website, but this is the general line that we use for profiling and then how to run it on multiple ranks, which is a bit complicated, but not not that big of a deal and it's listed here and then how to do it in compute.

F

So this is, if you're looking into profiling and how to run these things with profiling. Otherwise we have all the basics up here about how to configure your system. uh The only real trick that I would say when doing this to consider is always grab the entire node. Don't try and piece a node or get a part of it. If you do that, you start running into numera problems.

F

uh How the the the cpu and the gpus are laid out might not be exactly what you're expecting and so you'll get a different configuration or run a little differently. So one thing that we usually do unless we're running in this exclusive mode and getting our this interactive mode and getting one gpu is we get the entire node? We ask for the all the resources in the entire node and pick out the parts that we actually need for the run.

F

That way you get you're sure you get the same, numa configuration you get the same layout of your resources, and so you get consistent results. Otherwise, here it is uh feel free to you know, copy this one over and use it and borrow it and do whatever you need to and watch out for that gpu for per tasks flag, which is the one difference that you might want to account for, if your, if your code doesn't pre-configure your gpus like amrex, does.

A

So yeah, so that's a a very interesting note about the how the code works and and jira's gpu versus gpus per task, so so, if you're so so in summary, then, if your code or for most codes, um I guess they they offload to a gpu and- and it's normally one gpu per task is the is the most common assumption.

A

But if the vacuum.

E

A

Right, yep and so, and so for most people they would use gpus per task equals one. But in the case of amrex it does sort of careful management of the gpus itself. Yes and so you're, giving you're, basically with with g-rez you're, giving amrex control over how it distributes the gpus you're. Just telling um slurm give me all the gpus and let me figure it out.

F

Yes, that is correct, so unless you, unless your code, does some special initialization, you probably don't- you will probably want to change this line right here, but otherwise it's identical yeah. It should be hopeful, hopefully helpful to most users.

A

That sounds good and the the little um bonus kind of history there of how to run insight, yes to get a profile of your code, is a very nice addition as well.

F

Yeah we have a lot of users who regularly ask that question, so we decided to throw it at the bottom of the uh the samples uh script there. So that's that's turned out to be very useful.

A

Out of interest, have you seen that insight? Do you find it adds a lot of overhead? Do you do you need to you know, request more wall time or carefully shorten your job.

F

um So what we have we have built in um what are they called the the uh tight, the profiler wrappers, that sort of define locations in it, and even with that in there, which adds a little bit extra overhead and the end site systems? Profiler doesn't add a ton of overhead, maybe 10, 20 30.

F

However, um nsight compute can add a ton of overhead when you individually uh go after a single kernel. That can add a whole lot of overhead, especially if it's a big, thick nasty kernel, so the systems we can do that fairly regularly without too big of a problem. Now you always when you profile want to do a smaller sub sample. You don't want to do everything. If you do a you know, full production run and try and sample it. You will see overhead there's no doubt, but for like small little good test cases worth testing.

F

No there's not the insight systems with the timeline over time and the overall view doesn't seem to give a really substantial amount of overhead. That requires a lot of tweaking, which is good. It did it first, but it doesn't anymore.

A

Yeah, so it's good to know you can get a bit of a sense of what your gpu code is doing with that.

F

um Robert most generically, that's exactly what I mean. Yes, we grabbed the device count, set them up and then and parse them out, but for the most time we do that, but there's a couple of cases where we want to do something more fancy like give multiple gpus to a single rank or a weird subset, and we want to have the flexibility to be able to tweak that inside the codes. Yes, that's what we do differently correct.

A

So so a a, I guess, a normal code, then that just calls uh could it get device count and cuda set device. They can still use gpus per task equals. Something is that correct, yeah.

F

Essentially, it will find one and then put one right. It depends on exactly what you want and exactly what flexibility you want. But my guess is the vast majority don't bother to go through actually manually, looking everything up so yeah yeah.

A

um And andrew has a comment about uh a sense job step. Viewer ascent is the summit. um Yes, it's the summit. Testimony.

F

That's strictly actually a summit tool that job step viewer and um uh we don't have one. We have a job script generator that um I'm not sure what the plan is for perlmutter. If someone has has already put it on the docket to upgrade it for perlmutter or all that kind of stuff, um I'm not sure if it actually covers corey gpu either I don't believe it does.

F

So I don't think we have a tool like that, although we would really like one.

A

Right so that sounds like it's a custom-made tool for for us. Basically,.

F

It would take time is the issue we've discussed that a lot, but that would take some time.

A

Thanks kevin and uh daddy, were you able to access your script.

G

Yeah um yeah I'll share my screen in a second um all right. uh Can you see my screen? Yes, that's working. uh Do you see the terminal.

A

Yeah uh you're on dtno3.

G

Yeah, that's right! Okay, uh so let me just uh uh show you my submission job submission script for uh beginning um the neural network on um a few gpus uh on a single node.

G

If you would like to look at a script that does a multi-node training, then I can point you to a tutorial prepared by steve and mustafa, uh but I'll just go through this uh script, which is uh uh yeah one of the uh it's mostly just normal stuff. You know it requests a node and then this one is requesting four gpus. um uh I am using the nurse pytorch ngc image uh over here to train, uh so I'm just specifying which shifter image should be uh used to train. um This part here is essentially copying.

G

A bunch of data uh online furniture, as you can see, um is copying some data to the uh the nvme or the solid state uh drive on the gpu.

G

And uh yeah this is this part over here is telling me uh to turn the code uh which uh python uh uh environment to use, uh and finally, you are the line. 29 is launching the white horse, distributed uh training, job uh uh yeah. um uh Let me let me see.

A

So yeah so in this one you're using four gpus per task.

G

A

A on a single node and there's four gpus on a node. So presumably that's this well.

G

I think there's eight gpus uh on a node, but I mean yeah, um I mean uh the only yeah the there is. uh If you use like. Let's say you use four gpus and I submit two different jobs and if they somehow end up on the same uh node, then the gpu might not have enough memory to run uh both jobs. So you have to be careful uh when you're submitting but you're, using only part of the node, um or only a few gpus on a single node.

A

Right, so that's a that's a good tip. You need to remember the amount of memory that each gpu has when uh when setting up the job.

G

Yeah and uh it can also um paste a link to the essay tutorial that mustafa steve and I think, josh have uh developed, which should allow you to create a script which uses multiple gpus and either spite or ddp.

A

Yeah, so this is kind of a handy example of using shifter on the gpus as well right, yeah.

G

Yeah, I think you can look at what shifter images are available on uh the nurse repository using uh grep or something, and also you can build your own shifter image and uh push it to.

A

Do you happen to know when the shifter image was being built, if uh anything special had to be done for it for yeah for pytorch to use uh nurse gpus.

G

um I actually did not build this image myself, uh but I don't think so I mean I have. uh I used I built an image uh a while ago and I don't remember doing anything uh special, uh but at that time I wasn't using the distributed data parallel structure, so I could be wrong about that.

A

uh Yeah, so that's quite handy some images there to use uh yeah.

A

That's all I have thank you very much so um so we've only actually got about five minutes left in the meeting, but the the last couple of uh things- don't don't usually take very long, um so I think that's probably a good time for a bit of a q, a and uh swapping so swapping stories.

A

uh So I saw uh william had a script, but unfortunately, it's not accessible today um did anybody else have a job script that they're either using or trying to modify for gpus that they would like to show and- and uh you have a discussion around.

A

And if not, I see, there's been uh quite a lot of uh your questions and answers and discussions happening in the chat. But if anybody has any uh questions about um you're setting up gpu scripts, that they'd like to ask either our panel or the or the community generally, please unmuted speaker.

H

I'd like to add that the default for corey gpu is a non-exclusive access that it's a shared node access, so that, if you need exclusive access, you need to add, uh is it dash queue uh exclusive or something um at the top of the script? Can somebody.

E

H

If it is double dash exclusive or uh uh s batch exclusive, I forgot the actual syntax since it's on corey scratch.

A

Here we go, we have a a bit of discussion in the chat, so it said double dash exclusive and I think that is a dispatch option. So, yes, that's a good tip. The the gpus are not exclusive by default and um yeah. So if you do need exclusive use of a gpu and- and you might find that for a lot of things- you don't you know you or exclusive use of a node.

A

Rather, I think um you'll probably find with gpus that there's a sufficient amount of power in each gpu that um you know you only need a small portion of the resources on a node for single gpu type jobs or smaller jobs.

D

Yes, that's it. That's a good point. Rob.

A

You don't want to um require all 8 gpus when, when you're trying to.

A

A

Okay, so share my screen again in the in the meantime. You've probably already seen this, but um if not it's just a or or even if you have a good reminder, um so the quarry gpu nodes have uh their own docs help webpage at docsdev.net.gov and amongst the various information here, there's actually kind of a diagram of what the node layout looks like. So you can see. There's you know two cpu sockets with four gpus attached to each cpu socket and nv link across them all.

A

So in the last couple of minutes,.

A

Coming up we're very interested again in topic, suggestions, or especially, if you'd like to talk about your work as a as a topic and yeah, some things that you've done and or learned on nurse systems.

A

Let us know, drop us a line either either something in the webinars or uh or you can uh direct message me in um in slack or send us a ticket.

A

um It would be great to hear from people and a quick look over last month's numbers before we wind up so uh overall availability, we actually took a a few hits. In april we had a few a few outages. Unfortunately, there was a of course you know regular scheduled monthly maintenance, but there was a couple of issues that hit some of them external.

A

So we had a an electrical issue that took out uh sort of a couple of cabinets with some knock-on effects and- uh and uh it was electrical related, but this one was actually a hardware failure in the cabinet over here. So we did take a a few knocks during april.

A

um That's it hpss and cfs that continue to have very good availability. uh Core utilization was very high. We're up at 97 um large jobs were comfortably above our target, so we have a target of 25 of corey's workload being things that need something of corey type scale um and yeah. So in april we had uh a little over 30 and yeah.

A

We've been sitting at a relatively high numbers for uh for a little while there now uh tickets coming in and close at the beginning of may we had a backlog of about 400 and a little less than 500 tickets. So we typically see you might have noticed a trend here over the last few months. It's pretty normal to see in the five or 600 new tickets a month kind of range coming in.

A

And that's all we have for today. Thank you again. Everyone for participating, and especially um kevin and laurie and jody for walking us through your scripts.

A

Hopefully, everybody saw enough to sort of get some ideas for tweaking your own scripts or you know I have to get kind of comfortable with getting started if you haven't used coreygpu yet.

A

Thank you all again I'll. uh Stop the recording now and we'll look forward to seeing you at the film out of dedication and uh yeah at our next meeting.

F

If you have any script, questions feel free to throw them into slack chat. We can follow up.

A

Yes, absolutely we're uh chatting in the webinars channel, um but also for general questions. There's a the general channel is good too.