National Energy Research Scientific Computing Center (NERSC) Using Perlmutter Training, January 2022, 11 Jan 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Building and Running GPU Applications on Perlmutter

Description

Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/

A

So we're going to walk through some examples of uh building and running gpu applications on perlmatter.

A

So this slide is slightly out of date now, with the time we'll shift the times back slightly and try to catch up during during the next few sessions.

A

uh So in this session, we'll look through building and running an application on palmata with mpi and gpus, using cuda as an example, uh we'll then have a little bit of a break um which should around about hit at lunch time for those on the eastern time zone in the u.s and then we'll go into session two, which will be a slightly longer, also hands-on, oriented session, where we'll walk through a few kind of additional scenarios, uh such as a little bit about some of the math libraries uh using other compilers, rather than only nvidia.

A

Looking at good or aware mpi, looking at things that are not cuda and a few tips about using cmake and or spec when you're building.

A

So I go for this first session is basically to build and run a simple application, using mpi to communicate between tasks and cuda to offload computation to gpus within a task.

A

So here's a kind of a cartoon of a block diagram of what a couple of nodes of pole model looks like we have our amd xen3 cpu connected to four nvidia ampere gpus, a bunch of ram on the cpu, a bunch of ram within the gpus, we'll use cuda to move data and kernels between the cpu and the gpus and we'll use mpi to pass data around between the nodes.

A

So, first up when you are compiling code, so you'll have your normal, your program, source code, so a bunch of voc, c, plus plus 490 cpu source code files. These can have mpi calls within them. They may use directives for using gpu such as openhcc or openmp, those you'll compile with the regular compilers and and more specifically, generally on pole, mudder and nurse craze systems, you'll be using the cray wrappers, which give you the the mpi stack and some other kind of niceties built in.

A

So these compilers are capital cc is the c plus plus compiler lowercase cc is the c compiler and ftn is the fortran compiler.

A

This is all uh things that uh your nurse users have been using corey already for a while will be familiar with.

A

Then we have cuda code which comes in dot, cu files and those ones will compile with nvcc, which is part of the nvidia. Cuda stack just a kind of a tip, though with program nvidia, which uses the nvidia uh cpu side compilers. uh Those can actually read cuda code incorporated into the same source files uh to sort of enable that you can add the dash cuda or gpu flag at compile time and we'll actually see that in the examples.

A

So, looking at the software stacks that we'll be working with on the left hand side here, we have uh by default. We have the program nvidia software stack loaded. This gives you the nvidia compilers, plus the cray compiler wrappers, uh a few kind of useful, underlying cray libraries such as such as cray lipside, and the mpi stack.

A

um We uh recommend generally, if possible, use the provided crayon pitch rather than building your own open, api or whatever, because uh that's the one that's sort of best optimized for our high speed network uh and also in this is the crepey magic that makes the compiler wrappers kind of do a lot of things automatically without you needing to put extraneous options in there.

A

So this should be fairly familiar if you've already used quarry, what's new on perlmutter but might be familiar if you've used uh gpu and cuda applications on other systems is the cuda stack here which you can get with a module load, cuda toolkit and that gets you nvcc the envy of cuda compiler. We talked about before uh plus a bunch of libraries and tools which is all part of the nvidia to cuda toolkit that are needed for gpu code, so you'll need to module load, cuda toolkit when you're building things for gpus.

A

So what to actually load for most applications, including the examples today, we recommend that you use the pro gaming video stack and this one is loaded by default when you log into palmata, so unless you're changing something that should already be there to build gpu applications, which is going to be the case for phase one you'll also need to load a cuda toolkit module. There are a whole slew of cuda toolkit modules available on the system that match with different versions of cuda and different versions of the compiler, particularly the nvidia compiler.

A

The default one is generally the best one to use. You know off the bat unless it doesn't work pretty much. What you want to do is choose one that has a cuda version that matches what your application needs and the default one is currently the latest available, gooder version on burma, so 11.4, I think, if you're doing openmp or acc offloading or also using cudaware mpi, which we'll cover in a little bit you'll, also need to load one of the crepey excel modules and in particular we want nvidia. 80.

A

you'll see this number 80 pop up a bit. This is because they feel, like I guess, architecture indicator architecture version number of our gpus is is sm80, so that's that 80 is the ampere series. uh If you used corey gpu, you will have seen sm70 come up a bit. That was the volta that we had before.

A

All right, so, let's get kind of straight into it. um Hopefully, people have got a session open that they can log in to perlmutter run from there.

A

Git clone uh this, unfortunately rather long url, but what you can do is, if you type module, show training, it will print up a little help test that includes this url, so you can copy and paste it within there go to the the directory called cuda. Slash ex3 will jump straight. To example, three um don't forget to module load, cuda toolkit run, make and take a look at the output.

A

So how can I see oops.

A

Here we go so what we'll do to try to basically identify when people are ready to move on if everybody can use your zoom session to raise your hand to begin with, and then when you've made it through the exercise lower your hand and we'll kind of watch for hands that are raised and enhanced lowered to see kind of where people are up to. um I will also create a breakout room.

A

A

I'll be helping we'll call it. I hope uh if you have issues uh having accounts issues, answers to promoter issues.

A

Okay, so hopefully you should see now in your zoom controls, an option to jump to a breakout room and there's a breakout room called help with connections. I think, is what I called it um and we'll have uh one or two nest: people in there.

A

So if you're having trouble not so much with the exercise, not perhaps for the exercise, but but particularly with just getting onto pearlmatter, if there's something wrong with your account, uh please use the breakout room and then we can uh sort of you know separate those challenges from challenges of using the exercise itself.

A

Don't forget that we have the google doc, which is a good place, to ask questions as well.

A

So I can see I can see quite a few hands raised, which is a good sign. People are paying attention and raising hands and- and some may have already finished it's going up and down. uh We might make it uh sort of about five minutes instead of ten minutes to run through this. It should be a reasonably simple exercise. Hopefully.

A

I see a question in the chat from from william about. uh Should we clean it to hove or p scratch, you can clean it to either really. Actually um it's a reasonably small set of examples.

A

So for issues with compilation, uh if it's a, if it's a straightforward question, I think the google doc is going to be the way to go.

A

So we have a few nurse people watching the google doc, unfortunately with screen sharing. I can't see it at the moment so.

A

Hopefully, uh somebody from nurse will slap me if there's a any questions that I should answer.

A

uh That's a good tip from daniel if you're having trouble logging in directly um try logging into quarry first that can simplify some of the networking complexities.

A

Through the chat.

A

Zoom controls all over my screen. Okay, so what just happened? You should have seen something like the command. You see here appear on your screen.

A

What does this tell us all right? So up here, we've got uh cc. The mig file is calling on the c plus plus cree compiler wrapper it's all a single cuda file. In this example, um the cray wrapper is calling the nvidia c, plus plus compiler underneath and that accepts an option. Dash gpu equals and cc. 80 is the tag for the architecture that corresponds with our um gpus and it's creating an executable called uh vec underscore ad.

A

If uh you did have troubles with that, there is a already made executable. Actually, that is in the examples directory. That's pointed out by that module show training.

A

So you may have already poked around and discovered these there's a few more exercises in that same directory to walk through kind of at your leisure. We we looked at exercise three, which is mpi plus gpu, exercise one and exercise two, um a simple gpu kernel only without mpi, um particularly if you found difficulties, uh building the mpi one.

A

These might be a good place to start to solve, solve problems one at a time and the readme one level up should have a bunch of uh information, but we might have to check that somebody did comment that a readme.md file they found was empty.

A

Okay, step, then, is once we build. It is to run it important things to remember. uh This is a hpc cluster, don't run on the login nodes, submit a batch job. So in this case it's a very short job, so so it might not be as critical, but for any real work. You definitely want to be submitting a batch job. Also, when you're in the batch environment, you've got the full yeah, slingshot mpi stack, um you know, I think on falmouth is going to be a little bit easier than it was on corey.

A

You also do have, as chris noted, that gpus available on the login nodes, which can be helpful for when you're building things, for example, computer- wants to run little tests.

A

uh Other important thing to remember is when you're submitting a job on perlmutter. You must specify a gpu-enabled account name.

A

um This will look like your regular repo, but we'll have a underscore g suffix so for today, everybody's in the in train, three repo with the entering three underscore g repo.

A

So when you submit a job, you'll need to uh especially a and get that account when you're doing your your kind of real work later you'll use your own project account for that.

A

So there's a bunch of necessary s, batch options and having gpus. There are a couple more now than what you were familiar with on corey. The first bunch are the same. Pretty much um so you'll need a cue which is a qos and for almost everything, you'll want to use the regular qos.

A

You want to set a time limit uh if you give it just a number. That's the time limit in minutes for slurm. So in this example here where we're saying that, after five minutes, slurm is allowed to kill this job for a for a real job. You'll probably want a few hours finding the the right time limit is very application dependent and uh it is worth experimenting with to get the right number dash. Lowercase n specifies a number of mpi tasks notice.

A

This is a lowercase, so there's a number of mpi tasks as opposed to uppercase, which would be the number of nodes and we'll come back around to that on the next slide. When we're talking about those splitting work over gpus as well as nodes dash c sets, the number of cpus per task slerm considers a cpu to be what linux considers a cpu, which is actually what we might call a hyper thread. So our amd gpu nodes each have 64 cores.

A

Each core has two hyper threads, so so linux and therefore slim sees the node as having 128 cpus. So here that c32 means that for a single mpi task, you're reserving one quarter of a node we'll get to the next few in the next slide. But importantly, particularly for today is we have a reservation um called pearl mother day. One, and you know I think that actually should be two dashes in reservation, but it should be right in the batch file. I hope so when you submit it we'll go to this palmetto day.

A

One reservation which you should have access to because of the entrain three um underscore g repo that you that everybody is in.

A

For normal work afterwards, you won't be using a reservation. So you'll need to comment that out.

A

And new things are the gpu oriented options that you'll need to set in this case, so we specified here that we wanted eight mpi tasks.

A

We also want to specify how many tasks per node, and so our gpu nodes have got four gpus per node and so kind of the. I guess. The simplest way to use things is to have one mpi task per gpu, which is to say four mpi tasks per node. So this sort of corresponds with this c32, which gave us a quarter of a node for each task.

A

Gpus per task is one we're going to specify the n train 3 account and we want to reserve a gpu node.

A

So it looks like there is quite a few comments and questions in the chat and it looks like ronnie and laurie are hoping answer them very quickly. So thanks for that all right, then we come to actually running the gpu code, so uh skip out the top part of the um the batch script up here and it finishes up with pretty much s run and an s running command.

A

So this is, uh it should be fairly familiar for those who've used corey before just another thing that you'll see when we look at the examples in a moment is several of the examples. Don't have gpus per task, but instead they have dash capital g, so with dash capital g you're, specifying the total amount of gpus for the job. So if you've got two nodes, eight gpu is available dash capital g8.

A

This is kind of a handy shorthand, it's shorter type than gpus per task and good for when you're, just using one or two nodes, when you start using larger numbers of nodes, um calculating it out will get a little bit unwieldy and you know it's a little harder for documentation in terms of you've got to calculate it out to work out for nodes. So for for your larger scale, real jobs, you probably want to switch across to gpus per task, but that's what the dash g means in the examples here.

A

Next thing, of course, we'll we'll try it out in just a minute, but just a heads up what if it doesn't work, if you see errors, one thing to check is that you do have all of the s batch directive specified.

A

uh So an easy, easy thing to admit is: if you don't set gpus per task, uh then what you can get is actually a floating point error and the re uh it's a floating point error, not a sequel, uh and the reason for this is that basically, the gpu hasn't been allocated to the job, we're trying to run on a gpu. It doesn't have one it trips over. So if you get an error, first thing to check is that you have all of the asbestos uh directive set.

A

Okay, so now, let's go to another uh hands-on period um in your clone of that directory, you can go back to ex3 make if you haven't already done that which hopefully should have. If you didn't succeed in building it before the module, show, training or module load training will point you at a place where we actually have a pre-built executable that you can use you can copy across.

A

Then sbatch the batch script, there take a look at it first and once you've had success with that, and if you would like to go further, try tinkering around with it see if you can make it run across two nodes.

A

I will do the same thing if everybody can raise a hand in zoom and then lower it once you've succeeded, the breakout room should still be open. So if you're still having connection issues jump across to the breakout room for help and so we're partly caught up now, it's now 9 25.

A

Actually more than caught up um so we'll have about sort of five or ten minutes for this, and I think what comes after this is actually a break, so we can.

A

So we'll move along to the next step. We actually have two. uh We do have one more uh topic and exercise in this session before moving on to the second session, so we are still slightly behind, but uh not too far, and we should catch up in the next one.

A

So the other thing to talk about is affinity, so this one is assuming it goes away a bit.

A

So experienced quarry users will be familiar already with the ideas of affinity and binding, we'll use those on that system as well. So different cpu cores have an affinity, which is to say a closeness to certain memory and caches, and you can bind a thread or a process to particular cause to make sure that that thread stays stays on a core, that's close to its data.

A

um You know whether that data be in memory or in the local cache, and there are some environment variables and s run settings that you can use to control this, such as openmp places, oep places and the cpu bind option for sram.

A

So a similar concept holds for perlmutter as well, so the filament gpu knows you can figure it in. What's called nps4, it stands for pneuma nodes per socket four, which basically means that each uh each socket each cpu or what you call it node, I guess, is arranged so that um certain cores are closer to certain gpus there. There are four kind of pneuma nodes on each gpu node and each gpu is closest to one of them.

A

So this diagram, here kind of in a slightly cartoonish way illustrates that the ccd is sort of the unit that holds a bunch of cores in amd's epic architecture. It's divided here into four quadrants. There is a certain amount of memory, that's closest to each quadrant and a single gpu, that's closest to each quadrant.

A

So where this starts to matter is when you are arranging your job. Yeah you've got some gpu tasks. Some of the work can happen on cpus. It's spread over multiple nodes, so you're going to want to have some sort of control over this there's, actually quite a lot of several options that you can use around binding and the ones here are sort of a good place to start with sort of a you're reasonably sensible default.

A

So you can set the gpu binding in your sran command with uh this option. So here we've got sron dash, n8, so we're running eight tasks.

A

uh That's just cpu bind equals cause, which is to say that a given task uh is locked to certain cause. It can move around in the hyperthreads on those cores, but it has.

A

You know the this subset of cores available to it that have you know the the same cache as they used on the last time, slice and the gpu bind is set here to closest, which is to say so that the task you're running in this previous picture here it's running on certain cores and when it offloads to the gpu, it will offload to a gpu that is closest to that core.

A

I've seen here, I think we have a bit of an implicit assumption here of two nodes since we're using eight mpi tasks.

A

This link here, um josh.net docker jobs affinity, has more information about the gpu binding options and we can do a quick hands-on to try it out in ex-5 of that same repo that you've been working in.

A

I think these batch scripts have actually been renamed, since I wrote these slides, so it might now be called batch reg and batch close.

A

So if you go in here make the code it's the same executable, but these two batch scripts will do different things with the binding, so one won't do anything with the binding and the other will bind it to the closest gpu. If you look at the outputs of each it describes in terms of pci identifiers which gpu each task has available to it.

A

So what we were going to do here was finish with this exercise before the break, but we've already had the break. So it's difficult to spend about five minutes, doing this and then reconvene.

A

um So, let's do the same thing just to be able to see where people are at if everybody can raise their hand on zoom and when you've been able to run the exercise and you've seen some interesting output from it put your hand down and when the number of raised hands gets reasonably small or after a few minutes, we'll continue and run a so now that we're recording again to recap what we've talked about so far this morning, we've built and run a simple surplus plus application using mpi with cuda using compiler wrappers for the cpu mpi side code and nvcc for the cuda code.

A

We haven't looked at only using two software stacks, so we're using the programming nvidia software stack for the cpu and mpi kind of aspects of the code, parts of the code and cuda toolkit for the gpu and your cu files.

A

We looked at which sbh directives need to be set to run on the gpu nodes and we experimented just a little bit with gpu affinity.

A

So for the for the rest of this session, we're going to go into a few of the more of the edge cases for a lot of people you might be saying here. This is this is all very well for a simple training exercise, but my application is more complicated than that, so we'll go through a few other common scenarios that people are likely to hit and what you can do in those cases.

A

So some topics for for here are uh what about things like blast slap, fftw, etc when you're using gpus? What about if the nvidia compiler isn't, isn't suitable for or doesn't work for your application?

A

What's cudaware mpi? When do you need to use it, and how do you use it? What if the application you're trying to build and use doesn't use cuda directly? Is it something like openmp or open acc and then a couple of tips on building code when you're, using cmake or spec.

A

So gpu accelerated math libraries in cuda, so there are gpu accelerated, implementations of or alternatives to a lot of the common math libraries, so so, for instance, blas, which is sort of at the bottom of everything, there's a kublus which is a cuda equivalent, and you get that when you're modular kit lapec doesn't have a direct equivalent um in the nvidia stack, but the nvidia stack does include ku solver, which does similar things to a lot of the larpec routines uh and includes uh some of the laptop routines directly.

A

uh It doesn't have quite the same api though so you do need to write your code for it, but the good news is with the nvidia compiler there's an option you can add which is uh dash nvla math, and uh what that does? It basically adds a uh a la pac, equivalent um or laplacian plus equivalent uh interface to the cu um libraries.

A

I haven't included a link here, but if you do a search for nvla math on our docs, you should find something uh fftw, there's a ceo fft, which is a cuda-oriented fft and cofftw, which is an fftw interface. To that uh there's also cu sparse, which does yeah some sparse solvers.

A

So those those ones are part of the nvidia stack that you get with modular load, cuda toolkit. There are also a bunch of third-party math libraries that are gpu accelerated, um two that are probably particularly uh useful and important, uh magma and slate, which between them cover plus a subset of lapec and scalar pack, and uh look here's the link that you'll want to follow in our docs to some tips about using these libraries.

A

So we're not going to go into too much detail about those today, but just a reminder and a quick plug for an upcoming training that helen mentioned in the welcome this morning. uh This is only next week. I think um we have some training from nvidia about their hpc sdk, it's a hands-on training that will cover, amongst other things, these accelerate gpu, accelerated bath. Libraries, there's a link down here for registration and info these slides if you're quicker than typing it in these slides, are available from this training events web page at the moment.

A

So you can download those slides and click on that link.

A

Next scenario is what, if you're not using the nvidia compiler, so so we recommend the nvidia compiler, for you know as the as the first um approach for most things for gpu based applications on perlmatter. It's it's, the one that has the best support for the gpu.

A

You know tool chain, um it's the one that we've sort of done, the most with in terms of um you, know, nick's certain preparation on on filmmatter, uh it's the default, and it's the one that's loaded by default when you log in so that's kind of all for a reason, so yeah do try that first, uh however, you know we have. uh We have four different compiler stacks and they all have different strengths and weaknesses and yeah. You might find that for some applications uh you do hit uh difficulties with programming video.

A

It's very well focused on gpus, but yeah it might. You might find it it trips up on certain cpu-based cases. So our second recommendation, a second recommendation for a program, is program. Gnu, which is kind of you know quite widely supported.

A

It's pretty portable, it's available in everything uh which tends to mean that it gets bug, fixes and features, and so on fairly quickly. So that's what we recommend to the second alternative. uh We also have on the system. If you uh type um module avail program, you'll see, we have a there's: a cray program, programming, environment and an amd programming environment. These two currently are more cpu oriented than gpu oriented and um we haven't done too much with them just yet.

A

Okay, so oops a couple of uh limitations to be aware of for different um compiler stacks when you're using the gnu compiler, uh you need to choose the right gcc version for the cuda version that you're using now. The good news is that doing the default uh shouldn't just work, so the the default cuda toolkit is actually cuda, toolkit 21.9, underscore 11.4, and that supports the default gcc, which is 11.2.0, but if you're using an earlier toolkit version, you'll also need to use an earlier gcc version.

A

The gnu compilers that we have installed don't currently support open mp and open acc offloading uh that is coming soon. I think that it's not there yet um also the handy trick of having cuda code embedded in your source files is specific to the nvidia compiler. So with the gnu compiler, you need to have the cuda code in its own separate.cu file.

A

This is another compiler stack that I didn't mention before, which is the llvm one. um Llvm is yeah the clang and flying stack. um A few of the other compiler stacks are actually based on it and coming soon we we have uh plans and development of a program llvm as a nisk supported program based on the llvm compiler.

A

It's not there. Yet it's not far off. I understand um it's currently targeting c and c, plus, plus only uh it doesn't have a fortran stack in there. Yet it should have support for sickle and openmp offload and, as I said, it's not available just yet, but is expected soon.

A

We have the cray compiler.

A

The limitation we currently have with the cray compiler is that it supports the v100 gpus, which is the model before ours, but it doesn't yet support the a100 gpus um so to use that the a100 gpus could still run v100 code, but it's not going to be as as optimized and you need to load a different creepy excel module for that just one earlier.

A

Also, we really haven't spent much time testing this and so nurse's ability to support. It is a little bit more limited and aacc kind of has a similar issue there um in that nurse hasn't spent any real time, and uh you know building up expertise here. So our ability to support it is still fairly limited. That will probably be you know more of a focus come phase two when we have cpu oriented nodes. uh Also, currently the aocc compiler doesn't have the offloading support.

A

Yet so it's yeah it's it's really targeting phase two rather than phase one.

A

So, in summary, our recommendation is that useful nvidia is the first option and if that's not viable use program, gnu support for the other tool chains will improve, but so far our support is kind of limited to these two.

A

Some errors that you might see when uh using different programs.

A

Oh here we go, this is this is actually in reference to the next example.

A

So we're going to do a hands-on exercise in a moment and just a heads up of some of the areas that you might see and things that you might need to tweak while you're attempting this.

A

If you see something about floating point exception, uh we mentioned this before check that you've actually requested gpus um errors about bind request, not specify, does not specify any of the devices within the allocation complains about binding check if you actually requested all of the gpus in the node, if you only requested half of the gpus uh and it tries to bind to the closest of my binds, who might be attempting to bind to one that isn't actually, you know marked its allocated to you and if you hit a cannot open shared object file, this can happen if you built part of the code with one program and another part of the code with the different programs.

A

um You know they're trying to access sort of different versions of similar libraries and things can kind of get messy. So it's a good idea to do it and make it clean, and you know make sure that the object files are successfully deleted. After your swap programs.

A

So for our next hands-on exercise, let's uh try it out back on perlmutter, we'll use that exercise 4 and exercise 5 again see if you can build them with program gnu you might need to make a few changes to. um You know: make files and batch files.

A

um Just to note that this is viable with exercise four and exercise five exercise: three, if you try to build with programming, gnu you'll get some sort of curious, looking errors, and it's because that in exercise, three we've got the cuda and the c-plus plus code all merged into the same source file, and that uh feature is only supported by programming nvidia. So you won't won't be able to build that with program canoe, but it can be interesting to get have a try.

A

Look at what the error messages are and recognize them for uh when it comes to working with your own code.

A

So let's spend uh 10 minutes on this.

A

Have a crack at building these uh exercise 4 and exercise 5 with programming, canoe and we'll do the same thing if everybody can raise a hand and lower it when they're done, and that should to give us a bit of an idea of how how many of us have finished the exercise and we'll reconvene in five or ten minutes.

A

Hands-On exercise.

A

So continue if you are still uh stuck on anything, let's post a question in the google doc or uh jump into the breakout room.

A

So next, let's remove some of these windows. Okay, cuda aware mpi. So what does this actually mean? So nvidia has a feature called uva, which I think stands for unified virtual address.

A

um What it does is it prevents the gpu device memory as part of the same address space as the cpu main memory. So this uh this is a diagram that comes from nvidia's website um kind of illustrating what this means. So so the cpu has a certain amount of ram attached to it, and each gpu has a certain amount of ram inside it, as well and kind of you know, naively pre-uva.

A

Each of these is separate. It's a separate address space. They can't really talk to each other. Having a single address. Space is more like this picture over here on the right where the memory might be in physically different places, but it's arranged as a logically contiguous block of addresses, and what this means is that a cuda aware, mpi implementation and creating pictures in one of these can send and receive messages directly from the gpu memory of one node to the gpu memory on your gpu on a different node.

A

uh As opposed to you know. If it's separate address spaces, then what you would need to do is actually use a uh cuda memphis cuda to host a device. Could a device to host mem copy to move the memory from the gpu into main memory? Send it through mpi that way and then move it from main memory on the other end back into the gpu.

A

So obviously, having uh cuda aware mpi here to be able to transfer memory directly from gpu to gpu can save a lot of buffer copying, particularly when most of the work that your application is doing is happening on the gpu.

A

So how do you know if you're using it uh one good tip, is to use the ldd command to have a look at the executable this for a dynamic executable which is now the default on um kalamata, for building?

A

It shows what libraries, the executable wants to use and where it's finding those libraries and this can help to debug things like you're missing something from ld library path.

A

Another thing that can help is in showing which libraries you are using, and so, if you run ldd on your executable and you see, one of the libraries in here is called libor gtl cuda. That means that you have cuda aware mpi available to you. So gtl stands for something like the new transport layer. I think- and this is this- is craze library for providing gpu to gpu memory transfers. uh We've got network network transfers from node to node to actually make use of this cuda aware npi.

A

You do need to enable it at runtime, and you can do this with an environment variable which is in pitch underscore gpu support underscore enabled equals one.

A

A

So, let's uh give it a try, jump back on to into your call motor window and you might need to go back a directory to see it, but there should be a directory called cuda aware mpi uh take a little bit of a look at this. This is a different application. Again, it's a just. A very simple mpi broadcast has some memory, so it has a a buffer in gpu memory and it will directly transfer it from one gpu to another. Gpu.

A

We don't need to worry too much about the source code. At this point, I think that yeah, the focus here is really on how you build and use it, but it might be interesting to look at the source code just to kind of see what it is that it's doing, but most importantly, build and run it and run ltd on it and see. If you can see that live mbi, gcl cuda, a couple of tips, don't forget, you need to switch back to programming.

A

Nvidia, also make sure that you have the to cuda toolkit library loaded, and you will also need the um crepey excel. Nvidia 80 module loaded for this one to get access to the cudaware mpi.

A

So, let's spend about five minutes on that. We'll do the same thing, raise your hand and lower it when you're done and we have kind of a quorum we'll continue.

A

I have missed a step here. Hopefully, uh if I remember rightly, I think in that um exercise there was also a batch script that you can run um and in fact this output will probably be in that um in the output from that batch script. The batch script has the setting of the environment variable as well.

A

If you want to experiment, you can try unsetting the environment variable and what you should see is some sort of a probably a segfault.

A

A couple of other scenarios that uh you may be working with, maybe you're not using cuda, so there are a couple of offload options: there's openmp and openacc, and so different applications that you're using may use one of these inside the application.

A

If it's openmp, it will look something like this. You'll have directives, you have pregnant omp target teams, blah blah blah blah blah with some map directives in there uh open ap openacc is a little bit similar you'll have directives in the code that say things like pragma acc, parallel loop, uh open acc in a way is like kind of a higher level, a higher level of abstraction.

A

So if your application uses openmp, what you'll need to add at compile time is these c or c plus plus flags dash mp for multiprocessing equals gpu dash gpu equals cc 80. This is that magic, 80 number again, because we're using nvidia ampere and this last one dash m info is optional, but quite useful. It prints a bunch of information during compile about what the compiler is doing with the openmp directives and openmp offloaded, kernels and loops.

A

Similarly, when you're building open acc codes, you'll add a couple of options to your c or c, plus, plus flags uh dash, acc and again dash m info equals excel.

A

uh The m info does similar sorts of things it prints. A little bit of extra information. Acc is for open, ace and c so see the the difference here is slight bit noticeable. Openmp, you have a dash np option open scc. You have a dash acc option.

A

uh I think we're going to do this in. I know we have build and run it um so we'll go back to hands-on jump back into your clone of the training repo and you might need to go back a directory again to find it. But you should see a directory called openmp dash, open, acc.

A

Take a look in there. Take a look at the readme if you like it the code um and build and run it and take a look at the output.

A

Don't forget we're doing this one in programming, nvidia and we'll do the same thing so raise the hand and lower it when you're done and when we have only a few hands left phrase. We'll move on to the next step.

A

So the last couple of topics is about a few extra tips around building code, uh one that is very easy to uh trip up on is powerful, but also a little bit particular is using cmake. For the most part it should just work. We have some cma modules available on perlmutter. We have a fairly recent version, 3.22.

A

There are currently a few issues that we've discovered when linking math libraries in the cuda stack, particularly here things like could f of t could uh c? U f, f, t c, plus c. U solver is that these libraries are in a different location to the nvcc compiler itself and cmake, often trips up trying to find them. uh We have a tip on this in our docs, but basically what the tip is is to add the math libs path to your cmake prefix path with uh something like this command.

A

So the uh this uh opt-in, video, hpc sdk you'll, see that if you do module show cuda toolkit, it will show the specific path. It may be different from one cuda toolkit to the next, particularly changing cuda versions.

A

So if you insert this one, the important part here is this math libs component to it.

A

There was a question earlier. I saw in the g doc um about whether whether we should prefer cmake or auto conf. So we don't particularly um you know, not support one over the other.

A

In a lot of the cases, you won't really have a choice. If you're using code that somebody else wrote, you know, building a third party library, for instance, chances are it either uses cmake or auto conf to you know to set up and to make files, and so you just need to use what's there if you're developing new code, probably cmake is the way to go, but particularly um modern scene mates. Cmake has changed quite a lot over the years.

A

It's gotten better, basically, and the newer practices are generally much better to use and more more maintainable and sustainable than the old ones. I think if you dig back through nurse's uh trending kind of history uh training resources, I think we did actually a little while have a course on using modern, c major cabinets. So with taking a look at the slides and the recording of that. If you're developing code.

A

Finally, this is still in progress- it's not quite there yet, but we are working on setting up a spec, 0 17 0 module file and configuration uh it's already on query. It's actually there on palmata, but we don't have the module file yet we're still sort of testing it then and refining the configuration. But that should be there real soon.

A

Now um it's uh being set up to work also with the e4s deployment, so e4s is uh I've forgotten, actually what it what it stands for, but as part of the ecp project, it's a scientific software stack.

A

um Something like that: e4s yeah, extreme extra scale yeah, um but yeah there's a there's, a lot of really good. um You know math libraries and tools and so on that are part of that um and it all uh is also sort of set up using spec with specs sort of build recipes and manifests.

A

So uh we'll have that kind of available in poll matter in the not too distant future as well, but they their spec instance, is being set up yeah to work with that, so that will uh hopefully make installing a lot of third-party software easier. So we're not going to go into the details of how to do this for today, but just a heads up that it's coming.

A

I seem to have missed putting a uh final slide, so just to recap we're at the end of our slides and exercises for today, and fortunately we're a little ahead of time.

A

um So yeah we've gone through building a basic, um mpi and gpu sorry, mpi and cuda application and then explored a few kind of uh variations on the themes, openmp openhcc.

A

uh What to do if you're using kudera, aware mpi and the fact that's worth using a few pointers towards math libraries, some tricks and trips and uh errors that might trip you up and how to recognize them and what to do about them. And, uh very importantly, there is the repo that you have cloned.

A

That will hopefully provide some examples that you can use as sort of a starting point as uh you can move on to building your own code on pearl matter, and you know when you do hit errors, that these examples hopefully will help to narrow down the steps that might be missing.

A

So that's uh all that we have for today I'd like to think notation about the things that you know. I provided on perlmutter from cray and also helen and ronnie, and roll and moise and many other nurse staff who have been answering questions in the chat and did a lot in developing yeah. These slides and these exercises.

A

So uh and of course, finally, everybody who has um come along joined their training and participated and hopefully found it uh beneficial and be able to make good use of it um using coal miner coming out.

A

Thank you all and enjoy the rest of the afternoon.