National Energy Research Scientific Computing Center (NERSC) Migrating from Cori to Perlmutter Training, December 1, 2022, 1 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 03 Migrating from Cori to Perlmutter CPU Codes

Description

Part of the Migrating from Cori to Perlmutter Training, December 1, 2022.

Please see https://www.nersc.gov/users/training/events/migrating-from-cori-to-perlmutter-training-dec2022/ for the training day agenda and presentation slides.

A

So my name is Eric Palmer and I'm, a software integration engineer here at nurse and I'm very happy today to be here with you to talk about migrating from Corey to Pearl Mudder. um Today, the focus of my talk is CPU only codes.

A

um You know, I I'm I aim to this doc to give you lots of information or it's a general understanding, and so you can do what you need to do with your CPU codes, but I'm not going to get super technical to maximize ultimate performance. So um hopefully you'll find this useful. But if you want to eke out every inch of performance in your application on Pearl meta, you're gonna have to come back for more.

A

So um these are the topics I'm I'm looking to cover today and I picked these four, uh mostly because I see these as the major differences between uh Corey and Pearl Motor. So you know, the modules are slightly. The module system is slightly different. The programming environments. When we talk about what compilers are available are available.

A

um What flags you need to get the? What do things to get what you want. uh Those are slightly different. The way you compile codes, um pretty similar with one small point that one flag I just mentioned and the job scripts, because the architecture for the nodes has changed so.

B

A

Are the things I I'm highlighting here in this talk, so the first thing I'm going to talk about is is modules, um and you know your experience with core and modules is still valid on Pro matter. It works largely the same when you log on to Pearl Mudder a parameter login, node you're, going to be you're gonna.

A

Have these modules loaded by default, and the thing to point out is that um a few things here is is one of these modules represents the CPU architecture that the create compiling program environment will, which I'll mention later is going to use to optimize your code. Another one is the default programming environment, which again we'll get to you more um later, is the gnu programming environment, but also the third one is is really important. Here is if you're doing a CPU only code.

A

The default module right now is to load the default modules load, this GPU module and that enables the Cuda, where MPI, by default. It also loads several modules that are targeted towards GPU codes, uh that we recommend you you, you essentially disable by doing module load CPU if you're doing CPU only codes. So the first step is you come into Pearl Mudder, you log in you know, you're doing CPU only code. You should look at doing.

A

We recommend you use module load CPU to get your your environment set up the way that would be best useful for you.

A

The rest of the module stuff is is still fairly similar to what you've seen on Corey. um The commands here, such as module list module load, unload module swap, should be exactly the same. The two following commands this one may this is one that I don't know that everyone knows I feel like these ones. People use all the time, but maybe this one- maybe not so much. This will give you a lot of information about what the module is doing and I have an example later to show you that.

A

Finally, this last one is going to be the focus of my next four slides, because I really want to drive this point home that, um if you're using module Avail to find a module uh or Define software on Pro motor, you might not see everything right away, whereas if you use module spider, you're going to have access to more potential modules and more information for how to get to what you want right away.

A

So that's what's going on with this module spider here and we'll talk more about that in a second um finally, on this slide, I put useful tricks that I found useful and I think may be helpful for other people.

A

um If you would, if you prefer to just grab for a string through all the modules you can use this line right, where you can redirect the module output to your favorite. Oh, you know Linux utility uh the you can for instead of using module lists, you can use the shortcut ml T and this will put a nice vertical list of your modules.

A

So if you want to share those with somebody else in like an email or ticket, it's a good way to do it and another one is module reset which will reset your module environment back to the system default, so that can be useful for for creating a replicable environment when you want to test things and run stuff.

A

So as promised, uh this is all about module spider versus module Avail. Now they both still exist on Pearl Mudder, but the function slightly differently right. The and the reason is, is that the module system on Pearl Motor is slightly different than the one on Corey. The module system on Pro motor is called lmod, um where the one on koi I believe was tickle. The difference is, is they have a hierarchical structure?

A

So if your module depends on another module being loaded for it to be able to be loaded, it may not show you that you can load that module. So that's why we have module spider which will search those regardless of that structure and make basically give you more hits on any search. You get so to illustrate that I have an example where I'm trying to load I believe it is it'll.

A

Tell me I promise I think it's net CDF yeah, so we want to load just the the plain vanilla create net CDF, and this is just showing us that it's not currently loaded.

A

um We try to load it and we get an error. If you use this, other module show command. It's still unhappy.

A

um Now we're going to use module Avail to look for it and you see netcdf shows up, but that's not the one.

A

We want, and it doesn't give any other hits right, so it essentially looks like we can't find the module we want that create netcdf, but if we use module spider and we put in creating that CDF all of a sudden, we get a hit right and what modules with this module and for spiders information is telling you to do is to type out the full thing with the version you get more information, and this is where it tells you.

A

Oh, if you wanted to load, create netcdf, you have to load cray Dash hdf5 first and if I do that, we find that the module loaded we get the software we want and and everyone's happy. So in conclusion, module spider for the win module Avail, uh you can still works still useful, but but if you're looking for something and you're not finding it, please try module spider.

A

This slide I include it because I I think it's helpful. When you are trying to figure out, um especially like libraries, you want to make sure your library is linking, or you want to link a library to your application. It's really helpful to be able to use this module, show and see what loading a module does to your user environment, so, in particular, I would highlight the yellows and greens where the yellows is setting particular environment variables.

A

um Sorry, the green is where the environment variables are being set. So if you're, um when you're making your program or building your program, if it's looking for the hdf5 directory, it's going to set that variable to that directory. So that tells you where it's looking for it, uh the yellows. You know changing your path where it's going to be looking for a library. So if you're wondering where do I pull, you know if I want to explicitly link and I'm looking for the place for that Library uh I can I can do it.

A

This way, I get a lot of information from module, show okay, so that was the my major things on on modules uh that I the next thing I'm going to talk about are programming environments.

A

The three big programming environments on pearlmeter are the canoe programming, environment, the Nvidia programming environment and the credit programming environment. um We no longer have a programming environment for Intel, so um I know that's been a pain point for a lot of people. So hopefully the information here today will make it less painful to try maybe say this: uh the gnu programming environment and the GCC compilers and the g4train compiler and whatnot um for CPU only codes. uh You know typically we're going to recommend that you try this one. First.

A

Environment, isn't working for you, you can it's really easy to switch to a different program environment such as a create one and give create a shot and, and sometimes if your code is not compiling, just switching from gnu to Crate and it compiles and works and you're you're good to go. Then then you're good to go like both of them are equally valid.

A

The important thing to for with programming environments is that they work with the create compiler wrappers here.

A

So, for example, if I want to compile a c plus code and I have my compile line, you know CC I'll just make it up for now, because we're going to cover it later, my I'm going to say CC and the commands to compile my code. Well. If I'm in the programming environment, canoe gray, is going to automatically change that cc to the G plus plus compiler right, um you know the appropriate command for the G plus plus can bother.

A

So maybe I should use this one as an example, because I did this one yesterday I know it so you know change this to the command. We want add a bunch of stuff that we're going to see and then it's going to compile your code now, rather than that exact same compile line if I'm using the compile evapor if I switch to program, environment, Nvidia and I use that exact same line.

A

It's going to use the NVC compiler to compile my code as long as I'm using the wrapper, and it's going to make other necessary adjustments under the hood. So programming environments work really well in conjunction with these wrappers, and so we recommend that that you give them a shot and try them in this way.

A

um As I mentioned you know, switching between program environments can be useful for for testing things and solving problems. This is pretty straightforward. uh You don't have to use the swap or unload you can just if you're in if, for example, I'm in programming environment, like a new programming, environment and I, want to go to the create programming environment, all I need to do is type module load program, environment create and I'm. There.

A

Okay, so this slide has a lot of stuff on it, but this is going to bring some that point um that is sort of making about some of the benefits to using the cray compiler wrappers.

A

So um what this is doing is sort of comparing two different ways of compiling the same Hello, World, open, NC, openmp code, so suppose I use the GCC command and I compile my code. You are seen here everything. That's that's going into the compile line right if I use the CC wrapper.

A

Instead, um what I've done on this line is I've enabled this flag to create p dash verbose, which will show me all the stuff that's being put into the compile line when I do this CC, when I use the wrapper to compile um that's being hidden behind the CC, so what you will see on the command line, if you're using the CC uh wrappers you'll just do CC hello, world underscore openmp.c and so on, to compile your code and behind the scenes all these optimizations uh for the CPU architecture, um the MPI libraries will be included the science libraries, the create science.

A

Libraries will be included the the libraries that correspond with the canoe compiler all correlated with that programming environment will match and be automatically taken care of when you're using the create compiler wrappers, whereas if you're doing GCC, then then you have to you're responsible for that for yourself and if you want to even if you decide to use them even knowing this can be helpful for maybe refining this exactly.

A

So the rappers, like I, said they provide a lot of stuff automatically under the hood. uh They link MPI your science, libraries, le pack, blast scholar, pack and more just automatically.

A

um If you have crave modules like hdf5 and fftw, they'll also be automatically linked, and the compiler that was used to compile those libraries will correspond to the one that you're using with the compiler.

A

um This note, I, think, is important to uh for people who sometimes people have a particular questions about science. Libraries, you can. This demand live size, a good way to get detailed information about how the science libraries work and a great programming environment.

A

um If you have a build system um such as cmake, you may need to explicitly tell it to use the wrappers in the with a line like this right um to include this line. To tell so you make these are the the compilers I want to use um and it will take care of the rest. If you have the traditional configure make you know, make install type of building.

A

um You may find that you need to include this line to specify um the particular the wrappers properly to to build your function.

A

um On promoter, the default is for my um for libraries to link dynamically, um so so what that means is, you know, like I said, when you load that module into your your environment, uh it prepends the path. It knows where to find it. So when you want to compile the code with something like GSL I'm using the wrapper I, all I need to do is specify the package I'm linking I, don't have to give it the locations or the includes that's all taken care of automatically. So um that's a convenient thing.

A

If you're compiling your own shared libraries, you should um you know you can use this command to to essentially uh achieve the same result with these dynamically linked libraries and by default. uh Create will make a well we'll build these uh executables by default to make them dynamically linked.

A

um Another point to point out here about libraries and linking is this cray link type static uh is no is, is it can fail, and it's not currently supported on Pearl Mudder, so um if you're using that flag, when you do your compiles, then that's something you're going to have to investigate uh further with us.

A

This slide uh summarizes some uh some useful compiler compilation Flags, when you're compiling uh in particular the highlighted blue line is something I want to mention. Is that uh to enable openmp for your for your codes, uh you have to include the flag. um My understanding is on Quarry. That was happened by default, but now you must explicitly include that flag with your compilation to get that capability uh and finally, um just some like tips.

A

The stuff we've been encountered, especially for if you're, trying to compile older codes coming from Corey to Pearl Mudder, uh some quick tips for you. So if you're doing a Fortran code, uh there is, um if you find that your code doesn't automatically just compile like it did before, especially if you're coming from the Intel compiler to the maybe the g4train compiler, you can look for some compiler Flags to to basically alleviate some of those errors. In particular, I recommending this Dash standard equals Legacy flag.

A

um Another one that you hear a lot. Is this F? Allow argument mismatch um both you know this one, it's included in the standard Legacy flag, uh but you can also do it separately to achieve the same result um for C plus plus. You can take a similar path to look for things like the F permissive flag, which will uh make the compiler less strict.

A

uh You can get more information about code standards by adding this Dash w a pedantic flag as well so I'm, hoping that if you run into troubles you can try some of these tips and they'll help you move forward.

A

uh Finally, just to mention there's some, you know I said the big three there's some other ones that might uh programming environments uh and compilers that maybe a little bit uh harder to find so I highlight them here. uh For example, we have this clang Intel in compiler, uh available under the programming environment, llvm, it's not as full featured. uh It doesn't use the compiler wrappers right the way you access it is. First, you have to load these module files by using module use.

A

Then you load the nurse programming environment, npe and then you load programming by llbm, and if you want to compile a C plus plus code, you have to use this command here. If you want to compile a c code, use this game in here and there's not a Fortran one available for that. Yet.

A

Similarly, if you're one of the people who know you really want MPI, there is some MPI on parameter, but you have to use module, use first to access those module files, then module load, open, MPI and then this is how you would compile those codes using openmpi for more info. It's here in the docs.

A

Okay, um just to to make this totally Crystal Clear. uh If you have a code on Quarry and you want to run it on promoter, you probably should recompile it. uh You know it's I mean I, imagine maybe it could run I I'd be surprised, uh but take your take your source code move it to when you're on Pro motor recompile it again and then start doing. Your runs. I mean.

A

If you yeah I, that's this is the way forward. So uh what I'm gonna do now is show you some examples of just how to compile codes on parameter um and I. Would then my example code is just a hello world that has both MPI and openmp um built into it. So you know the details of this. Aren't really important, just to know that it's a simple code that does these things.

A

Okay, so this is an example of using the compiler wrappers on Pearl Mudder.

A

All right, so that's my example. These are the modules I have loaded like we said before. You can see I'm in the program environment canoe. So when I do this command, it's going to use the canoe compiler to compile my code because I want to enable openmp I have to include that flag. It will not be included by default, so you must include that flag um now, I'm specifying the environmental variables for openmp and that one should be open and proc buying.

A

True should now go to equal spread, we'll talk about that later, um but I just have to point out here and I'm, also in an interactive note, so I'm not not running on a login node in case you're wondering, but so the takeaway from this short example is, if you were using the wrappers before and you're, using them now uh compiling on Corey and compiling on promoter. Isn't that different right? This you? If you're using compiler wrappers, you should be mostly the same, and it should you know, like we say, just work.

A

uh The only thing you may have to look out for is that Dash F openmp flag, if you are using openmp in your code.

A

um This is the second example in this example.

A

um What I've done is I've I have a software package that I manually installed in my user space and I want to link against the libraries that it provides so um this you know this is kind of a more manual approach. I think it's worth uh looking at, because I'm sure I'm, not the you know, from my experience, uh there's more than one or two users who who wants to manually link to own libraries.

A

um So this is what this example is, showing you I'm trying to compile that my example. This hyper underscore exe that requires a hyper underscore utilities.h file. uh That's ins, that's included in the hyper package that I've already downloaded and installed in a different location, um because the path to that location is kind of long I, I save it as a environmental variable like this, so the hyper underscore dir and then I'm going to use um I'm going to use that uh to access the files, but also in my compile line.

A

So this is just showing you how I can use that environmental environment variable uh to to add to to run, commands and use it in other ways. So.

A

Here I'm explicitly, including the includes from the hyper directory and I'm. These are the lab, the libraries and linking hyper and then I'm running the code, and you see it compiles correctly and again I'm on an interactive session here, so I'm running it and it's not on login node. So.

A

So the the next section um I spent quite a bit of time on and and this is kind of like.

A

I hope this I hope this helps I think what my goal here is like I want to take you from looking at a job script like this and wondering kind of what is going on with all these things and give you some understanding.

A

You know not, you know not 100, but but some feeling or Instinct about whether what you're doing seems reasonable and and to do that. I really have to talk a lot about the architecture of the promoter, CPU, node um and and things you know, and the way the memory is set up. So that's what the next section is going to be discussing um I'm starting here from the JavaScript, because this is how I went to you to kind of approach.

A

It right, I want to translate the the commands and the parameters that I select here over to these ideas. All right. So you know I listed these key terms here: the node MPI task, logical, CPU, thread, physical, core processor and Newman domain. uh All of these are going to relate back to the parameters and things you do here. So um let's go for it.

A

So the first thing I want to talk about with the job parameters is how it relates to the hardware so.

A

And the first you know thing that you're going to encounter is that the terms used for things are not always the same. They sometimes are the same and they are not always different and they are sometimes different. So um if you pull off the prometer system, architecture, page what it says about a promoter, CPU node- and you compare it about other places in the nurse dock, um some places it will call it CPUs.

A

Some places will be referring them to processors, I'm going to try to stick to just one way to refer to each thing for the next 10 slides.

A

So if, if it's a CPU on this one I'm going to be talking about the processor, when I talk about a processor I'm talking about the the the the chip that you see in the motherboard, uh when I talk about physical course, I'm going to talk about how that processor that chip you see in the motherboard is split up into smaller computational units and with inside each one of those physical cores. You know we have I'm going to call them logical CPUs, but this is when we start talking about like hyper threads.

A

That's another way to think about Hardware threads there's also terms that have been used to describe this so uh on Pro motor tpus, there's two logical, CPUs per physical core and again this is just to point out that you'll also see socket, but here I'm going to be using the word processor to refer to the chip uh in the same the chip that goes in the socket, um essentially so by by adopting these terms and keeping them consistent to the next couple. Slides I hope that helps keep these Concepts clear.

A

So if you were walking down the street and you ran into a pearl Mudder CPU compute node, would you know what it was? Would you know what it looks like? Well, here's a nice picture. Okay! So if you were walking down the street, you run into this thing. This is the Pearl Motor CPU compute node. This diagram here on the right.

A

um What I want to do now is relate these terms to parts of this diagram. So the first thing is the node, so the node is I'm going to say is the big outer square that includes both this yellow box and this yellow box, which represents the processors so in each parameter, CPU node. You have two AMD Milan processors and we're going to count from zero, zero and one so zero one here and one is here inside each one of these processors is going to have 64 physical cores right.

A

That's the lines you see here and whatnot next to each uh group of 16 of them. They have their own memory. That will come in to play later and we'll discuss it more. um But within each one of these physical cores you have two logical CPUs for doing. um You know what they call hyper threading. So you get two logical CPUs. So.

A

That's the diagram for the Pearl Motor CPU and those are the terms so I'm trying to give you a sense of how those relate uh I'm gonna, give you this office building analogy. Hopefully this helps you uh Maybe not immediately, uh but later, when we start thinking about things, it's going to be useful to be thinking about. When are we talking about uh different parts of the architecture and how it affects things um so so bear with me.

A

um So one way to think about a CPU compute node is you could think about it as like one floor in an office building, you know the pro metal machine is made up of multiple nodes, we're going to be talking about one part of that one floor of the office building.

A

You can think of that one floor of the office building having two office floor plans, one representing each processor inside your office floor plan you're made up of little cubicles, so each square is a little cubicle here and um you know in case you're you're following along uh the question. The mystery question of the day is which cubicle represents, which system right.

A

So this one uh there's one nerve system that would have a four person cubicle um and uh only one so which one or which, which nurse system node, has a four person cubicle uh the pro the cubicles on the promoter. Cpu are two person cubicles, so inside a two person cubicle any little box like this, you have two people working at their stations, all right and those are The Logical CPU or the hardware threads. So the cubicles are the physical cores. These are the logical CPUs that are doing the work within it.

A

All those physical chords come together to be the processor, and you know we'll bring this home Point home more. But you know physical cores that are closer together, usually it's easier for them to communicate if I'm a physical core working here over with the office people over here that might take longer or might not be as efficient. So that's when we start getting into the new more domains which we'll talk more about okay.

A

So this is the Highlight where we are now so if I say Dash, N2 I'm talking about notes, you now have a sense of what those nodes. What I'm talking about when I say a note right now, I saw Dash c16 right. Those are The Logical CPUs. These are the workers inside your cubicle inside your office plan inside your office building right. So you have a clear sense of what this is talking about, what these numbers mean and how it corresponds to the hardware on that.

A

um On that note, so that's why I highlight these and then again here when we talk about the CPU bind setting when I say core is here, you know that this is uh relating to the physical core right, the cubicles that we talked about here. So now you have a sense of this. What this core word is meaning an association with the hardware.

A

Now the next step is this still other parameters, but these ones are associated with how you're splitting up the work you're, giving to the hardware and particular for MPI tasks and openmp threads.

A

So I'm asking a lot but bear with me again from my cargo analogy: uh MBI tasks and threads are about how you split up your work. So the first step is taking your simulation code and, if it is, uh you know, if you use MPI in your code, you're breaking up the work.

A

That's all this stuff in the back of the truck into smaller blocks right, so, in particular, I'm thinking of this uh picture, representing a one, two, three, four, five: six, seven, eight nine ten, eleven twelve thirteen fourteen fifteen NPI tasks, um I really should have put a 16 here.

A

It would make me feel a lot better, but that's okay, uh each one of these uh palettes of boxes, which is you can think of as one MBI task in this analogy, and each MBI each MBI task is this palette of little boxes, which you can further break up into open, MP threads, all right, so using MPI tasks and opening openimp threads is a way to like break down your work into smaller pieces from from one starter stage to lower stage.

A

So that's why we want to think about it. That way.

A

So now, when we come back to our job script, we see all right. What is that the breaking up of work when I'm taking my code and doing that? What how does that relate to my JavaScript? Well, the dash n?

A

Those are the palette of boxes, right, you've taken that truck full of work and those are the that's the actual number of pallets of boxes or pallets of in the back of the truck all right, that's with a 32 and that's what that corresponds to the openmp number of threads is how many boxes on each pallet right. So when we talk about a thread now, this is the section we're talking about, and now you have an intuitive sense of what piece of the work of your simulation. uh That's relating to.

A

Get the rest of the terms we have to understand Newman domains and if you're, like me, uh This this term um may have been something that didn't come as easy as the other ones. So what is the Newman domain?

A

A Newman domain is a non-uniform memory access um or is what it means is non-uniform memory access, and essentially it goes back to this idea that if I have my um a physical course Computing work on my data I have some memory which keeps that data really close, so I can do really fast work, but if I have to talk to get the data from over here uh to work on I have to do this communication step where the communication comes through here and then I can get it and then I can do the work.

A

So, if I've got this person working on this one and talking to this one to work on that, one you can see bouncing back and forth would make that a lot slower than if they could work right next to each other and didn't have to exchange the data uh from one memory bank to another. So the takeaway here is it matters where, on the processor you're doing the work. If you want to achieve maximum performance now, if you're closer, you get better performance, that's the whole point of the Newman to me.

A

So now, let's go back to our diagram of the parameter. Cpu um I should say the promoter CPU node right. If we look inside each yellow box, each processor they're split up into four numer domains. So that means that each uh promoter, CPU node, has a total of eight pneuma domains on it. So when you're assigning, when we set some of these commands to assign where the work is going to go on the hardware uh you're going to want to be aware of these eight different Newman domains um as such, so so it's an.

A

I think we're good, let's see here.

A

This is a way to basically uh provide that information in a detailed and kind of um and detailed way. uh So, if you are in one of those promoter, compute nodes, you can run the command uh num act, L, Dash, H and you'll see that there are, it says, eight nodes, but these are eight Numa domains labeled from zero to seven. It will tell you the physical cores, if you're, counting only red right, you'll get up to the 128, so starting from zero to 127 128 physical cores.

A

um If you also include The, Logical CPUs, uh then that's where the black numbers come from. So in this new domain you have logical CPUs, which can uh physical cores 16 physical cores, which includes all of these different logical CPUs.

A

The other nice thing that I really shows you here is this distances where it gives you a measure of you know time units difference depending on which numer domain is working with which one so, for example, here if I'm in new one domain, one then I'm uh talking to Newman domain two.

A

It has only 12 units uh of distance between these two and you can consider that, as like a you know, distance of time here, uh 12 units of time, whereas if I was in one and I'm talking all the way to the other, you know these new domains live on the other processor. That time can include it can increase to almost three times as much right, so the the pneuma domains on my processor can be very, very quick.

A

The ones that are further away can be longer, and this shows you specifically if you're talking from which geometry domain to which one, uh how exactly that impacts. You.

A

um Because you know like we're talking about three times difference in performance, um we provide multiple tools, so you can verify that the Affinity is working. The way you want. um We have these binaries um pre-compiled binaries. You can run these with this command that will spit out the information of where your ranks are um and what the Affinity settings are for each one.

A

um If you're using uh openmp 5.0, um which several other compiler support, you can include these environment variables and these parameters, and when you run this, you will get more information.

A

Just like you see uh exactly the information the formats you want to hear so that you can read these things to make sure we're getting the threat if you need, um with that said, I think for most people we're going to rely on sort of the kind of nurse defaults and you're going to get pretty good performance with them as long as we use them correctly. So so these are the this is the like.

A

We've got this I'm going to start telling you what the suggested way is to run to make sure you don't incur uh pneuma performance penalties and that your code runs well. So here at the center of the general rules of thumb, if your number of MPI tasks is less than the number of physical cores on on the promoter CPU, the number of MBI tests on that node is less than the number of physical cores in that node. Then you should be including this flag.

A

In my experience, this is almost all the time right that your MBI task is less than the number of physical cores, um so I would say like if I had to guess right now, I want to say 90 if you're, in sort of the much uh the less common situation where your number of MBI tasks is greater than the number of physical cores, then you're going to want to set this to a CPU bind equals threads, um yeah I'm going to leave it at that, and we can talk in more detail if you, if you want to follow up and understand these things, deeply I'm happy to chat more later.

A

um If you're, one of the consequences of these Nemo domains is, if you're running a hybrid MPI, openmp code, you want to use at least API MPI tasks. So that way, when your work gets split up across those newer domains, uh your openmp threads are close enough to each other that they can work quickly.

A

um Okay, you can imagine if you had only one MPI task for the entire node, that it might put one openmp thread way over on this side of the the One processor and where, on the other side of the other processor, so that the communication between them would be really slow, whereas if you add more Pro, uh more API tasks, you would avoid that situation uh again. Another rule of thumb is the value of Dash.

A

D should be the number of physical, logical CPUs should be greater than the number of openmp threads and uh again for placing things correctly to tell the the the job scheduler where to put the stuff in the right places we're recommending you always set this OMP proc buying to spread and only places to threats um again, there's some smaller edge cases where you might choose something different, uh but for most people most of the time, this is probably going to give them most.

A

The performance they're looking for- and the only other thing to point out here, is that previously we'd recommended OMP underscore proc underscore bind equals true, but we've found that spread is uh probably a better a better option in in general uh on Pro motor CPUs.

A

So this corresponds to these last parts of the job script, the OMP places the CPU bind cores and I can say now. When you look at a JavaScript like this, you should have some uh sense or feeling about where these terms are coming from and how you're setting them um this this chart uh gives you sort of the differences between some of the uh nodes that you know versus the ones on perlmeter and how these numbers break down and how you can use them when you're making those decisions.

A

So this is just to highlight that this this information is here um and again. You know this formula uh will always work uh as well um I, just for myself personally, I find it helps to have that intuitive understanding.

A

So now practical examples of job Scripts, all right.

A

So, if I'm doing, I've got an example of an NPI only JavaScript here, one from Corey Haswell uh I'm, including the best practice for MPI only this is no we're not doing any openmp threads, but I include this line as the best practice to make sure that is always clear um and what I have here is I have a script uh as such, where I'm using 40 nodes with 1280 MPI tasks on Corey Haswell, but I want to write a job script uh that runs this efficiently on promitter CPU and one way I can do.

A

This is as such over here I can use 10 nodes running the same number of MPI tasks right. That's always again how I'm switching my work and then this is the number of logical, CPUs per MPI task. So how do I calculate this value? And you know why do these things end up being different right?

A

Well, I'm, starting with my number of MPI tasks, I'm, dividing that by the number of nodes that I'm using, and that gives me how many MPI tasks per node right then I, look at how many logical, CPUs, logical, CPUs I have for that. Node and in Corey Oswald the case it's 64.

A

uh and then I divide that by the number of NPI tasks, and that leaves me with two logical CPUs for each MPI task and that's why I get the dash E2 likewise, because the numbers are are the number here especially, is going to be different right I? Do my 1280 MPI tasks, I'm, dividing across 10 notes, so now I have 128 NPI tasks for each promoter, CPU node each promoter, CPU node, has a 256 logical CPUs.

A

So when I do that, Division I get two logical CPUs for each MPI task and that's what I put here again, because 90 90 of the time um we're still we're still in this scenario, this is what's included here.

A

Another option is I could have kept the nodes at 40 and changed. You know now now I'm, basically summoning more Hardware to solve this problem uh and the way I. The way that affects my job script is the C changes accordingly, so I'm doing the same math, where I have 1280 processors and divide that by 100 2080 MBI tasks, which I'm dividing by the number of nodes that leaves 32 MPI tasks per nodes now, but because I have 256 logical, CPUs I can contribute eight logical CPUs.

A

Those are my workers in the cubicle to each MPI task and that's where the C8 comes from.

A

um So this is the test. uh It may look simple at first, but the the curveball is I'm I'm, also adding uh openmp um and the second curveball since we're also running late I'm, going to wrap up real fast. Is that doesn't make a lot of difference? Here's my hint! uh This is the math that I explained in the previous slides uh to calculate this number. It's doing the same thing.

A

I just described right, I'm, doing cast divided by nodes, then logical, CPU, uh logical, CPUs, divided by the MPI task on each node um and then the step. Three is something that I do because I'm also using openmp threads I, want to make sure that the logical CPUs is greater than, and they should say greater than or equal to uh the number of threads that I'm assigning to each MPI task uh here right, so 32 is bigger than eight, so I am good and this will script should run um pretty well.

A

There you go, uh and this is the script is to this slide- is to highlight the difference between an MPI only run and a hybrid MPI openmp run, and you notice the settings for C CPU bind cores, those haven't changed. The only changes here is when the environment variables the ompi environment variables. I'm studying the last thing to point out is we have a JavaScript generator, which you know, uses sort of um a drop down approach where you can answer these questions, select answers and it will automatically generate a job script.

A

um This is a good place to learn and a good place to start. It may not cover every single edge case, but now you understand the hardware behind it and why these numbers are what they are. So if something comes out, not perfect, you can fix it and again just to point out here we're going to update this very shortly. But as soon as we do that you will see LMP Brock buying. Anything where it says true will be now be equal to spread.

A

So um just to point that note out here um with that I am going to stop. These are my key suggestions. So if you are going from query to parameter use module spider for a comprehensive module search, recompile, your query codes on Pearl Mudder start with the program in the gnu programming environment, then move to try to create one or try some of the other ones.

A

um We highly recommend you use the compiler wrappers because they do so much for you behind the scenes and you also need to re, go back over your javascripts and recalculate the parameters to get good performance um with that I'm going to stop and uh thank you for your attention.

B

um Erica, you think you just go over the Hands-On slides, because the GPU talk has the Hands-On routine. So let's do the CPU one here as well.

A

One side: okay, yeah right um so later on we're gonna, have a Hands-On section. uh This is Gonna include these.

A

um Cpu only codes that you can play with that you can work with and you can try out some of these uh uh Concepts on and nurse staff will be around to to guide you through that and and help you um with those examples. Those examples are nice because you know it's good to start with something that isn't the most complicated and build your way up so Helen. Do you want to say anything else about this slide before I move on.

B

So yeah we have a few slides and there's a readme.first and suggest you doing the exercise. Following this order. We have zero and MPI hello world example, hybrid MK, openmp, with c and Fortune, and also Affinity example. So Affinity is, as Eric has presented. You will see all the results. All these Affinity values you can compare against the diagram, the physical core, logical numbers. You know where your presence spread or bind to and there's another um spec using a package available from the e4s deck on test.

B

Gsl I'll be using the reservations after the the GPU talk then, would be the Hands-On section today.

A

All right, thank you so much everybody.