National Energy Research Scientific Computing Center (NERSC) Migrating from Cori to Perlmutter Training, March 10, 2023, 10 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Migrating from Cori to Perlmutter: CPU Codes

Description

Migrating from Cori to Perlmutter: CPU Codes
Presenter: Erik Palmer, User Engagement Group
Training: Migrating from Cori to Perlmutter, March 10 2023

A

uh So, thank you, everyone for for coming to our training uh today, I'm going to talk about migrating from Quarry to promoter and I'm, going to be um focusing on on CPU codes.

B

A

Right so this is this is my outline. uh You can of course, sort of broadly break. This talk up into five sections, um uh but you know there's quite a bit of overlap between different parts of this. So uh let's here we go uh so I begin by talking about modules and modules are essentially how you access uh pre-installed software on the system and the big difference between Quarry and pearl Mudder. Is we have a new module system on promoter?

A

um It does look very similar to uh the one on Corey, but it is slightly different.

A

So um when you first log on to Pro moderate, there will be a set of modules loaded by default, so this is sort of like the the environment configuration or the software. That's already loaded on your system when you start up, if you haven't made any other modifications- and this includes things like um you know- optimizations- for the CPU architecture- that's what you see here in yellow. uh It also includes the gnu programming environment, which includes the GCC compiler.

A

Those are these highlighted in red and uh by default, um because Pearl Motor is a GPU system. We have, by default a lot of modules uh Geared for gpus, so um if you're doing a CPU code, uh what the for one of the first things you're going to want to do is to type module load CPU, which will reconfigure this environment for CPU coding, and you can see basically what it does.

A

Is it unloads a lot of the GPU specific modules uh from your environment and we'll also turn off computer Cuda, aware MPI, which you know um which may cause problems for you. If you're trying to compile your code later um and don't want it um largely. However, the module system works similar to what you're used to on Quarry. uh You know, module list is the same as before: module load and unload is still going to be how you load packages into your environment and unload packages in from your environment.

A

um The big difference here is going to be a module, spider and and I want to point this out, because this is also going to be a common one that you use, because this is going to be how you find uh software that you want to load into your environment, uh specific packages and and another and whatnot um previously on Corey you would have used uh module, Avail and module available will still work on Pro matter.

A

It just won't show you everything and I'm going to show you an example that sort of drives that point home in the next uh slides, uh but but for now commonly used module commands that you're going to be using on promoter these ones. Here, module swap still works. It's still useful modules show I. Think is really useful and I'll do another slide on that later, it will tell you what is going on when you load a module into your environment and um I have I, have two cool tricks: I'll.

A

Just briefly describe this one and I'll, let you try this one on your own uh next time, you're on the system, uh but this one is what it's going to do. Essentially, it's going to take the module spider command and it's gonna pipe it to grep, but to make that clean, you can use this redirect flag right and that's the dash dash redirect and then for the spider command.

A

You can tell it to use uh to search by regular expression with this Dash R flag and that's why I just use a DOT which just basically tells it um search for every single thing in this in the module system and pipe the output to grep, and then you can use grep to search for the string you want. So, if you're more familiar or more comfortable using your own, you know text searching tools.

A

uh You can do it with that cool trick um and uh for more information on this, you can look for the docs for the lmod environment. So um what is the difference between module spider and module Avail on Perl matter right? So if you use module and spider um on Pearl Mudder, it will search without regard for hierarchy.

A

So, on Pearl Mudder, the the module files are kind of arranged in such a way that if you were to type well, if you're looking for a module with module avail, it will not be shown to you unless you have all the dependencies already loaded. So in that sense, if you're typing module available and let's if, unless you can just uh load that unless you already have all the dependencies, the dependent module is loading. Your system, module available will not show you that module as being available available right.

A

So um that's why you're going to see difference in the output between module, spider and module available.

A

So in this example, what I'm going to do is I'm going to be just searching for create netcdf. This is that's the name of the module I want and if this is going to show you the difference between what happens is I go through this process, so um bear with me as I as it types what I just said. These are the modules I currently have loaded um and just to point out that it's not there I'm going to try to just load it. It's just it's not available is what it says.

A

Right, I, try to also use module, shows still not telling me anything when I use module available to look for it. You notice that I get gray, parallel netcdf, which is not what I'm looking for here right, and so it almost seems like create if we creating that CDF doesn't exist.

A

Well, if I use create module, spider create netcdf, ah all of a sudden, I've discovered it now with module spider if I type out the module name and the version I get even more detailed information, and that's where it tells me that I need to load create hdf5 first, if I want to load, uh create netcdf.

A

So that's what happens here and.

A

We see that the module is now loaded here, so this uh this example, uh which uh we'll leave with the slides that you can view it as many times as you like uh shows you that you know: module spider with Spiderman defeats, our our Superman hero, module available in this case and searching for create netcdf. So we're really going to recommend to users to to kind of change that default habit and move toward module spider.

A

Another useful module command is module show, and um you know so so this slide is a lot of text. uh But what I wanted you to see is that module show, gives you a lot of information about what the module is doing when you type module load a particular package. So it's essentially doing these three uh two two large categories right, which is changing your paths and uh setting some environmental variables.

A

So, if you've seen yellow like these are the things I've highlighted as changes to paths right, um you can see it's adding the place where these create hdf5 libraries exist down here, right where the hdf5 directory and hdf5 root. It's setting those environmental variables so that when your applications are looking for hdf5, they might search for this environmental variable to try to find where it's located and that's where this is being done. The other important reason to show you this is um sometimes people in part of their building.

A

Their code have these Library paths hard-coded, and uh it might be helpful just to be able to look at the location of the library and to that might help you sort of troubleshoot or or kind of take a shortcut to solving uh some of your compile issues. If you need a specific library and you're trying to look at where to find it.

A

Okay, so uh in the next section, I'll I'll talk a little bit about programming environments and we'll get into a little bit more about compilers and and and things that go along with it.

A

So um I'm gonna focus on on three main programming environments on Pearl Mudder. um The default one is program, I call it program, environment, I, always called gnu, but I guess you could call it gnu. uh The big new one um is programming environment Nvidia for us uh when we're talking about CPU only I I, you know I expect that we may not use that one as much.

A

um However, you know um maybe, if you're going from starting at CPUs and moving to gpus, you might focus on that more um program. Environment Nvidia is also totally capable of compiling CPU only code. So um if you want to try it out it, it's definitely worth a try. If you're trying to get stuff to work uses the credit compilers uh and it's listed here- that's another viable option. So usually, like we've set this as a default, we usually suggest that people start by.

A

If it's your first time trying to compile your code on Pro matter. We suggest you start with the programming environment, canoe and then sort of Branch out to the other ones um and and to see, if you know you get better performance or or you're able to compile in places where you couldn't compile before, but that's kind of our general advice.

A

um One thing I'm going to be emphasizing in the next few slides is about wrappers, so the rapper is going to work in tandem with the programming environment. So, as you can see here, um to use the G plus plus compiler for cs plus I'm, going to use this Capital CC wrapper and if I was in programming, environment and videos. So if I was over here and I was just typing CC and program environment new, it would call the G plus plus compiler.

A

If I was in this environment and I used the CC wrapper, it would call the NVC plus plus compiler right, um that's what's what this table is showing you how each one of these um compilers changes uh for the wrapper. As you change uh programming environments- and you know for MPI each one of these uses the great mpitch MPI that we the default, that we recommend.

A

So how do you load a programming environment uh just like the other modules like module load programming environment?

A

uh If you want to switch from one to another, for example, if I'm going from gnu to cray, I can type module load, program, environment, gray, I, don't necessarily have to swap or unload uh like, maybe previously, so that is slightly more convenient um but, like I, said, the programming environments, kind of work in tandem with the compiler, wrappers and I kind of want to continue to sort of encourage you to use the compiler wrappers and that's what what this slide is sort of talking about.

A

This is sort of showing you um not only does the compiler wrapper like sort of automatically set the compiler based off of your programming environment, but it also includes many other things that you don't necessarily see so um in this dark blue line here.

A

I have like a sort of a typical compile line that I would do with the GCC compiler and um that's just I'm, compiling a hello world example with the few Flags to give it openmp and to tell it how to Output my code and the second line down here in the light blue area I'm using the compiler mapper, now I also added this flag that says, create PE, Dash verbose, which is going to basically tell it, tell me all the extra things that are happening behind the scene when I type this compile line.

A

However, I want to point out that if I did not have this red box flag here, the output you would see would be exactly the same as you would see from this one here right and the command. Would you know all this down here would go away, and it would look just as simple.

A

So the only difference between these two is I've added this flag to say Hey, you know tell me all the extra flags that are being added by the crate, PE compiler wrappers and tell me a little bit about all the amazing things that are that are happening. So that's what you see down here right so once you show all that stuff. This create compiler wrapper in this example I'm in the programming, uh the canoe programming environment. So it's calling GCC and it adds a flag to optimize um for the CPU architecture.

A

It also has additional Flags to further optimize for architecture. um It is including you know, our default and right, it's also including the science, libraries and several other things that we will find helpful, and you know this. This list goes on quite a bit, but it's kind of too big for one slide. So let me cut off here.

A

Furthermore, um if you're using the wrappers, several things will automatically link, you know, like I, showed you MPI, those the the science libraries like uh the pack Bloss scholar pack more, if you have loaded create modules, those get automatically linked by the chromatic by the wrappers, um a quick kind of a side note about the scientific libraries if you're looking for the packs, color pack, plus those they are included in the cray libsi module and for more information about exactly what's in there and how to use it.

A

I recommend that I've been looking at the manual with man website. um I'll, tell you more about that modules are linked dynamically by default. um You know. So, when you load these modules into sit, your your environment, a lot of the times they'll be the pass, will be prevented to the LD Library path to this create LD Library path. The shared libraries will be dynamically linked um if you're, combining with your own shared libraries, we consider you know adding these options for the r path um in general.

A

You should know that create wrappers build dynamically linked activity executables by default. um One thing to be careful about is, if you've we're using the static flag or this create environmental, pretty environment variable down here, creepy link type equal static.

A

Um This can fail on Pearl, mutter and- and it's not supported at this point still so um so we would recommend that you do not do that.

A

So, just to kind of you know compilers and flags, there are a lot of them and uh it's a little bit daunting to just keep rogging you with more and more Flags, uh but um I've only got a few more Flags, so this table sort of puts those things together: uh I've separated them out. So if you're, this is like a new programming environment. This is the cray programming environment. This is the Nvidia programming, environment and I've just pulled out a few common uh compiler flags that you might want to use with your codes.

A

um Typically, you can see that it creating a new behave almost identically. However, if you're compiling codes in the programming, uh Nvidia programming environment, you may need to make some small um changes to to achieve the same things.

A

um One thing to to point out here is a big difference between Corey and pearlmutter. Is that openmp is not enabled by default on perlmeter.

A

So if you want your codes to incorporate openmp, you need to add the flags and the flags for canoe is just F openmp, but on Nvidia, that flag is different um in this MP um and again, uh I will tell you that, from my personal preference to get into the nitty retails of the flags on the compilers, the the manual pages are really helpful um in in using you know, searching through those can be a quick way to get definitive information about what you need.

A

So that's why I've listed the manuals here in case you need more specific requirements um in questions. This can be a good source.

A

Another thing to point out, because um you know what we've in my experience uh you know a lot of people coming from Corey to Pearl. Mudder are usually bringing codes that um maybe a few years old. By now, uh and one of the big differences between Corey and pearlmutter is we don't have the Intel programming environment and the Intel compilers that go with it, um at least at this moment, and so when you're trying to compile a code on Quarry.

A

If you were compiling with the Intel compilers, you might find it doesn't automatically compile with the gnu compiler on perlmeter, and so we have some sort of tips here to kind of help. You with things like that. um So that's what's on this, this slot right so for Fortran, um especially older portrand. uh We one of the flags. We recommend you try when you're, if you're having trouble uh compiling and when you move to Pearl Mudder. Is this?

A

Allow argument mismatch right, so this is kind of a more specific uh targeted thing um about what it's telling the compiler to ignore. But if you know to take that sort of farther in the like, ignore everything you can Direction about, you know sort of older code practices that may no longer be allowed.

A

You can use this um the standard, equal Legacy flag, uh which will again reduce the sort of strictness of the compiler and allow it to sort of bend the rules a bit more like sort of maybe the older compilers were famous for allowing um if you're talking about C or C plus plus there's again the same idea try to find some flags that reduce strictness just to get your code compiling. uh We have this permissive flag, um this other the W pandemic pen, the dentic pedantic can one year about lines that break uh code standards.

A

So these are some suggestions to help. You get your code compiling if you're having any difficulty with that uh on Pro matter. It's it's worth just trying them and seeing a few fit the jackpot.

A

Okay, um so the next section I have just a few quick tips about cmake and make files and and when I'm talking about make files here, I'm talking about make files and the auto tools sense, not the make files that see make makes so I just wanted to kind of make that distinction.

A

um The other thing I kind of want to point out here is these. uh The the the tips are sort of high level, because when we start getting into build systems like cmake and make files they're usually there for when you're, compiling fairly complicated code, which usually means your make files and and your cmake build systems are fairly complicated. So um these these really are some tips. They might. You know it might be kind of Hit or Miss for each person, but uh you know, hopefully a few hits make it worthwhile.

A

So uh in particular, the reason why this comes up here is because, when we're using the Crave wrappers a lot of times uh deep down in those, you know that cmake build system or in the make files for other tools. The compiler has been hard-coded and won't uh accept um the compiler create compiler wrappers. The way we think it should- uh and the first thing to try if something like that is happening, is to use one of these two techniques.

A

So if you're doing the typical auto tools, method where you configure and it makes a make file and then you make and then you um install right the way you do, that is with a line like this right, where you tell it the things before when you were asking for the the CC, you want the C compiler, the cxx, that's the CC wrapper or the FC. You want the ftn wrapper right. This will give it that will point it into the right, uh create compiler wrapper for each of the compilers.

A

You want the C compiler, the C plus compiler, the portrait compiler uh for cmake, if you're having that similar type of issue like this is typing. This line on the same line as your cmake command or um before you call it uh can help remedy the same problem right. It's it's doing the same thing. It's telling um your code ahead of time that you want to use this. The C compiler compiled by the wrapper or the the Craig wrapper for this C compiler, C, plus, plus and Fortran, and so on. Right.

A

So, uh as I mentioned before, like make files and the cmic build system can be incredibly complex. So what I have here is a really kind of like basic example uh that points out kind of like where I would start to look for things if I was having problems with my make file and some sort of easy adjustments, I can make uh to incorporate the cray wrappers, which would allow me potentially to solve some problems uh compiling my code, so uh in particular right if this is sort of my example make file.

A

So this is really like an existing make file like you already have one that's set up, and you just type um uh make to to compile your code um you're looking for a make file. That looks something like this at the top of it. You'll generally see several environmental variables defined, one of the ones you're going to be looking for when we're talking about compiler wrappers, specifically, is is the CC or the cxx, or these type of things here. In this case, you see in this example, which I took directly off the web they've.

A

What I'm going to call hard-coded the compiler would be the GCC compiler. So that means, if I switch to the other compilers I'm, not going to be switching away from the GCC compiler, if I'm in the query, programming, environment or the Nvidia programming environment, I'm always going to get sent back to the GCC compiler.

A

um You know which, in the case of C Minit, be a huge deal. But this illustrates the point. Well, so, if I'm looking to make file and I see an issue like this, what I'm going to want to do is I'm going to want to change this so that it points now to the C compiler right because I'm this is the C compiler that it it wants I'm, giving it a c compiler. So that's the lowercase, CC and once I do that.

A

That's all I need to incorporate the create compiler wrappers uh into this build system. So now, when I type make it's going to use the create compiler wrappers, it's going to do all those optimizations and other stuff, I told you about before all hidden. In the background as part of everything you want. So um I s, you know when you're looking through your make files, if you notice something like we talked about before, and you can make this easy edit I suggest you you give that a try and see. If that helps you out.

A

That was my tip for make files. My next tip is on cmake and it really can be boiled down to one thing, and that is this: application called CC make. So um what I have on this slide is basically the walkthrough if you're not totally familiar with the cmic build process.

A

This is kind of how it typically goes so on this first line, I'm just kind of showing you what files are in the directory that I'm in right now in this case, I have the code I want to compile, and it has a cmake list text file, which is what cmake use uses to know how to build your code.

A

um I have a typo here, because I need to make a directory, but so I make my directory I move into my directory and so I'm building this code from directory above it um or the directory out of side of it, and then I use cmake that I call with a dot dot to um this is kind of analogous to the configure step, but this is how you invoke cmake and it will do make it cmake files as part of the build process.

A

Now, if I get to this step right- and this is working or even if it's not working, I can type CC make dot dot and it will bring up the CC make interface. So that's kind of like where, in the cmaid build process, you can use this tool and what it looks like is it's a graphical.

A

It gives you a user, a graphical user interface, to kind of investigate the different uh parameters in your build, so in particular like in this example, I'm I'm, using it on that simple example with one cmake list and one you know the the hello world, openmp CC file, I'm trying to do it to compile, and you can see that cmake sort of automatically fill in a lot of these uh values.

A

Now on the screen that I'm showing you now I've turned on advanced mode, basically, because we're focusing on which compilers are being Chosen and here I'm going to look on the for this line in particular I'm looking at the C plus compiler, and you see here that the C plus plus compiler was chosen to be this one right.

A

So what that tells me if I'm inspecting the cmake gold process with this tool, CCC make that themake did not pick up the Craig compiler wrappers for some reason right, and that can indicate to me that I have to go back into some part uh into the cmake list.

A

um C makes um you know into that that file to configuration, make or I may have to configure something different in my environment so that it picks this up correctly.

A

So if I were to do something like that or if maybe I was looking at a project that picked those up correctly, you would see something more like this right. So if I was looking here at the cscx cxx compiler, you can see that this one has gray PE 2.7.19 bin CC right. That is the Crave wrapper for this, create C plus plus compiler um the create wrapper for the C plus compiler that we want right.

A

So um if I saw this I'd say, okay cmake is picking up the compiler the way I expect it to, and it's and it's doing what I wanted to.

A

um The other thing to point out here is this: is page one of nine there's a lot of information here that you can kind of look through to look for issues um you know, depending on, if you're having trouble with your your build system, so I I, like this tool um and I, find it helpful uh for finding issues, doesn't necessarily solve them for you, but knowing where the problem is usually pretty helpful.

A

Okay, all right, so you know summarizing back to where we are those things that I pointed out. That's what we're trying to address by giving you these tips right to try this line here or to try this in your configure step and and for more details about this. As always, um we have docs on these things. That can be quite helpful.

A

That I still personally find helpful, so I feel fine to share them with you, okay, um in the next section, I'm going to show you a few quick examples, uh I think I'm gonna just do one based on how things are going uh for compiling code on parameter.

A

um The main takeaway here is that the thing I cannot say loud enough is: if you go from Quarry to Pearl matter, you should probably recompile your code. um The architectures are different enough. Those optimization optimizations in there that will speed up your code.

A

You know you will code will run faster if you recompile it on Pearl Motor, it's very possible that if you take it directly from query to Pearl meta, it won't run at all so recompile your code on parliamentar the example I'm going to show you is just a simple uh MPI and openmp example of a hello world. It's going to say um you know hello from different threads and processes um and out to the screen um so that you know this is kind of pared down as much as possible. I think for this example.

A

um I'm doing this from the gnu programming environment- and you know like I said this- is uh you know a fairly straightforward example? But maybe this is just uh give us a taste of how to do this. You can see I have the just the default list again um I'm going to compile with the compiler wrapper.

A

You know in this case the lowercase CC, and the things to point out here is that you Sonic and Powerline I, told it to use openmp right. I had to include that because that's not included by default on perlmeter. So if you want openmp, you have to include it there now I'm, giving it the variables setting the variables to do the threading and then I run the code and and yay it works right.

A

So the big takeaway from this simple example is that if you're compiling with the compiler wrappers and you've been compiling with the compiler wrappers on Corey, you should find that these are very similar right. It should be very similar experience. uh The only main difference is, if you want openmp, you have to include the flag down right. That was not the case before.

A

So um that's the moral. The story here, um I have another compiled example. Here, I'm just going to leave it here, you can watch it later it.

A

uh You know down, uses the library that I downloaded and linked it in and shows you how to do that, but that will exist for eternity for you to look at it when it's convenient for you um in the next section, I'm going to talk a little bit about understanding, job parameters and I do this, because you know your job parameters are going to change when you run from coming from Quarry to Perimeter right, because architecture is different and if you understand what each of those parameters mean, that's going to help you make intelligent choices about them.

A

So that's why I take this this perspective, so um you know when you want to run a job. You have a job script. That's what we're looking at here, um in particular the things I'm going to be focusing on are the things that are highlighted very very lightly in different colors. Is this part here this part here these parts here these parts down here and it's going to be, you know, sort of? How do you understand these parts and how do you make choices about these to understand these parts?

A

The key terms I'm going to focus on are node MPI task, logical, CPU thread, physical, core processor and um Advanced terms is going to be the Numa domain. um So the last time we did query to Pearl and matter training I spent more time on pneuma domain I'm not going to spend as much time on it. This time, if you're, if you're interested in hearing me talk about it a little bit more than than that, video is still available online. So today, I'm going to give you the shortened version.

A

Okay. So here we are node, processor, physical, core and logical CPU. One of the difficulties with keeping all these parts of the architecture straight is I feel like the names are very easy to mix up.

A

um If you know people say different things at different times, they seem to mean different things at different times, so I'm going to start by defining what I'm going to mean for the rest of this talk when I say each word.

A

So if you look at our page for promoter system architecture, it's going to say two AMD, epic Milan CPUs I'm, going to call that two AMD epic Milan processors. So when I say the processors I mean the same thing as this all right and it's going to say, 64 cores per CPU I'm, going to use the word: 64 physical cores per processor right, so I'm, chasing CPU to processor here and cores I'm, going to call physical chords.

A

When we start talking about two hyper threads per core I'm, going to call those two logical, CPUs per physical core right, we got the chorus being physical cores, hyper threads and logical, CPUs, right, new domains um per socket, I'm going to say Newman domains per processor, all right. So here we have the diagram of a CPU. Node I've got one processor here, I've got another processor here. This whole thing, together with the two processors, is one node right and inside each of these processors.

A

Well, we've got a nice picture here um here. The Wider, blue square is my node right inside my node I have two processors in this picture. It's the yellow Parts by each one of the processors. Can you can think of it? Looking kind of like this right inside each processor, you have physical cores right. These are the little tiny things right and here and here and here here and here and here A bunch of physical chords.

A

Each physical core is capable of processing to um instruction threads and because of that I'm going to say inside of each physical core. There are two logical CPUs, so one physical core, two logical, CPUs right, and so that's how the words that I'm, using translates to the architecture for the Pearl, Motor, compute, node and I'm, going to try to keep using those exact terms for the rest of the talk all right. So.

A

Now, to sort of understand a little better about how the architecture works, I'm going to give you kind of an analogy, and the analogy is an office building analogy which is probably wholly unoriginal, but that's okay, um so here I have my office building. You can think of it as full of nodes. Maybe each floor is a node and on each floor there are like maybe two uh office layouts like this- that we can think of as processors right like because this is sort of how the breakdown of our node works um inside it.

A

We've got two office floors which are marked two processors. Our office floor is made up of little tiny, cubicles right where people do work inside of each cubicle right. Those are the physical cores on our system and um you know: I I gave it away because I had the same picture last time, but um I'm gonna hold it to the end, but this cubicle could represent only one sort of specific Hardware. We have here at nurse and I'll.

A

Let people try to guess in the in the Google, Doc and figure out what it is or or maybe in the chat. Would probably be better not to not to muck it up, I, don't know, but I'll give away the answer by the end, or somebody else will uh and then we could also think of our cubicle as sort of setup like this right and that would represent.

A

That would correspond to our physical cores inside of our processor right, each one of these little box, but inside each one of these little boxes right, whether it's shaped like this or whether it's shaped like this you've got a little worker, doing your instruction thread and that is The Logical CPU right. That is the hardware thread. So these are your workers inside your cubicles, which are your physical cores. Your physical course go inside your office plan, which is your processors and your processors live on your office floor, which would be your node okay.

A

So the reason I go through. All of that is so that when you see this N2 something immediately pops into your mind, it should hopefully be an office floor full of cubicles right if I start talking about logical CPUs those little workers inside of each cubicle that corresponds to this Dash c16 uh parameter here. Those little people inside of each cubicles should pop to your mind, you should be thinking, oh, the hardware threads for the physical cores um on my processor right and if I start.

A

If you see this word here, of course, you should know that corresponds to those cubicles right. Those are the physical cores in the processor and again, you know. Processor for node also goes there, so so this is how the terms are matching up to our system and how we're thinking about these things.

A

Okay, now to understand kind of the the that was the architecture. Now this is sort of how we break up our problem. So that's when we do that. We're talking about the number of MPI tasks, the number of open, MP threats for this one, uh no I'm, sorry to have so many analogies at once, uh but let's try to keep them separate. One was for Hardware.

A

This is really for you choosing how to break up the work of your simulation code right and so the way I gonna ask you to think about that is you can think of this truck as sort of carrying a whole bunch of you know. These are pallets right, a whole bunch of palettes that all together correspond to your simulation code, the work that your simulation code needs to do and you can break up all these little pieces of work into MPI, not little pieces. All these palettes, all these big pieces of work into MPI tasks.

A

So if you do it that way, your MBI task consider as your palette of different boxes. Now, if you want to further break that up by using openmp, you can think of M each MPI task being further broken up into the individual pieces with the threads all right. So that's where openmp3 comes in and that's where the threads come in. We've got whole simulation code broken up into number of MPI tasks, and each MBI task gets broken up into individual pieces of work. So that's the way to think about what's going on there.

A

So now, when we come back to the JavaScript again, we look at this part right. This Dash n32. That's how you're dividing your work up into those pallets of boxes.

A

Now, when you look at this number, you have that in your mind, when I look at this line here, export OMP number of threads that says for each one of those set of boxes, I'm going to break it up into eight pieces, so each palette has eight boxes on it now. So this is how I'm further dividing down the work for the job that I do so now.

A

You know MBI task, you know thread you have some intuitive sense about each of these pieces. So now, if we go all the way down these lines, you should have a good sense of all those things.

A

And I pointed out for new my domain right, I, didn't say much so I'm going to give you the the shortened rather than spend a lot of time in introducing human domains I'm just going to give you the advice to follow, to make sure um the way you arrange your work respects the Newman domains so that it will run efficiently and the reason I can do. That is because we have a sort of a general set of guidelines.

A

We give you that, for most cases is pretty successful at helping your code run efficiently, and so this is what those guidelines are right. So, if you're looking at your number of MPI tasks, that's more than the physical course right, so if your palettes, the boxes, sorry is less than or equal to. So if your palette of boxes, right of your simulation code is broken up to is less than the number of physical course that are available so uh then you're going to want to use the CPU underscore bind equals course option.

A

If MBI intest is more than the number of physical cores available, then you're going to want to use threads if you're doing a hybrid MPI, openmp code, then you're going to want at least eight MBI tasks.

A

To avoid penalties is such a harsh word, but it's just it's not as optimal as if you, if you use at least eight you'll, get a much more optimal uh experience uh if you're using MP, openmp threads, um and one thing to also check is that Dash C should be greater than or equal to the valid the value of the number of threads. So you know, if you have two threads, you want Dash C to be two or more right and then to make sure that those threads kind of um execute close together.

A

So if they're working on the same thing, because threads tend to work on similar things, um it's nice, if they work closely and not like one person, works on this end of the office and that person works on that in the office and every time you want something. You have to walk back and forth. That can be kind of annoying and slow things down.

A

But if you use these settings it'll make sure that when you've got those two pieces of work together, it's the same two people in the cubicle working together and that'll make sure that they can communicate quickly. So if you follow these guidelines to set those parameters, um you you know, we've found that most cases you will get a good experience.

A

So with all those things said, uh we now know this part of the job script right that came from those guidelines. They gave you- and you know this part of the job script that also came from those guidelines. I just gave you so we've kind of sort of covered, the things that relate to the new and domain and things that are associated with how threads are processed so now.

A

um Well, I, guess in the next later on we're going to go into like how we do some job scripts. But this slide here sort of incorporates um the details from each of the different architectures on the nodes that we have right. So you know Haswell you're, familiar with Corey K L you're, also familiar with uh our Focus today is parameter. Cpus right you've got 128 physical cores.

A

We've got two logical, CPUs per physical core right. So again, these are the uh cubicles right. This is how many people are in each cubicle. This is how many uh physical, how many logical CPUs per node, so how many people are in each uh office floor right so for our office. Our building floor has two office plans.

A

How many people work in those that office floor right, and this is how many Newman domains um just sort of which areas communicate more quickly together, and this is the formula that you can use for each one of these different architectures to calculate the value of C right and so we'll use that in the next couple slides.

A

So that's what we get to job Scripts.

A

So um what I'm going to do here in these examples is I'm going to look at a job script from Corey, Haswell and I'm, going to talk about how to change it to a job script or uh the program matter.

A

Cpu node, so in particular, um in this example, I've decided um that I don't want to use openmp, threading and so I've set this variable at one, which is just kind of a best practice and I've decided that I want to split up all the work of my simulation to 1280 MPI tats now I mean that's. For me. um How this number is chosen is kind of depends, a lot really on your application tuning these parameters to find the optimal result.

A

It can be kind of application dependent, so I'm going to sort of be giving advice today, as sort of General guidelines.

A

So yeah, so we've taken our work. We said we're going to split it up into 1280 pallets and for each palette. I want to have.

A

Sorry for each MBI task I want two of my uh workers to work on it right so two of those workers inside of a cubicle working on it, and so that's where that one comes from. So what I'm doing in this example is I'm, keeping that constant right that Dash C I also want two workers to work on each MPI task over here, and so the thinking that I've gone through to to come up with these numbers is like this right.

A

I've taken my total number of MPI processes and I've divided it by the number of nodes right, so I had 32, MPI, sorry, MBI tasks, 32 MBI tasks on each node right, then I use that formula that you saw in the previous slide to determine the value of C to be 2. right. So the same thinking kind of happens over here for the parameter CPU right so now, I've done I have 1280 MBI tasks, I divide that by the 10 nodes here, right and well. I should say that this is.

A

This is how I came to the number 10 here is I thought: okay, if I take 1280 and divide that by 10 I get 128. and I know that if I put in 128 into that formula from the last page, I'll get two right. So I know that that's the right number to put in this formula, which means that this number should be 10, which means that the number of nodes that I want should be 10..

A

And so this will run the same amount of work with two logical CPUs on each MBI task just as before, but I only have to use 10 nodes instead of 40 nodes.

A

So that's an example of moving one to the other.

A

um In this example, now I'm keeping the number of nodes constant and I'm changing the number of logical CPUs for each MPI task. So it's a similar thing right same con same computation here over here, I'm splitting up uh my MPI task across the 40 nodes. So each node gets 32 processes, but because there are more physical cores and logical cores available on a promoter, CPU node I can afford to associate eight logical CPUs to each MPI task instead of just two.

A

So this might you know this is sort of doing more workers at the same amount of problems.

A

So this is sort of an example where we get to play this game. um You've got 32 nodes. uh Sorry, we've got our our work. We've decided to split it up into 512 MPI tasks, we're going to split it across 32 nodes. What do we want the value of C to be if, in this configuration?

A

Well um here is my hint right I'm splitting my 512 MPI tasks across 32 nodes, so that gives me 16. I put that in that formula as before, right and I, because I'm doing openmp I have to do that check for my rules and I find that check, and that told me it's good. This also sort of gives away the answer right, because I'm making that check here and told me that the answer to this question is uh 16. and I'm.

A

Basing this on the assumption that I want to make full use of all the computational power on the on the Node, um if you want to you, know I the job script generator is a sort of an automatic.

A

uh Does this thinking for you so automatically you can put in your parameters and your your requirements and what you think you want, and it will generate these job scripts for you.

A

um So it's a good way to learn and it's also a good way to get a job script uh out and and get you some something to try. That seems reasonable. um You can find this in two locations now um both of these work so I'm, giving you both here and with that I'm going to sort of summarize. uh So the key suggestions for my talk are use: module spider rather than module Avail. It will show you more things.

A

um Recompile your query, codes on pearlmeter we've got the new programming environment. You have the create program, environment, the Nvidia program, ad they're worth trying. um You know obviously start with the default, then move to the others, uh use the compiler wrappers because they do so much.

A

um You know kind of it's kind of unseen, but there's a lot of optimizations and and other things being pulled in um and it allows you to more easily try the different programming environments and uh you know, look back at obviously uh look back at your job scripts and and try to uh you know, recalculate your JavaScript parameters for for Optimal Performance on Pro matter, and with that uh there's only one more note here.

A

um Do you Helen? Is this the usually the slide that you talk on like.

B

Okay yeah. So during the Hands-On session later um we have prepared for the um uh CPU this part of the exercises. This is the GitHub repo there and we have a readme.first. It basically tells you encourage you to work in this order, and then you have a readme file for each example, hello, world cereal and MPI code and then matrix multiplication or Jacoby, which is a c example or Fortune example to do hybrid, MPI and openmp, and we also have an XT High Affinity example. You can compare query.

B

Compare uh on the CPU side on parameter, CPU find out all these flags that Eric talked about and understand more with the the the the what what uh chorus with High uh hyper threads are that your opening, pnm MPI openp threads are binding on there's, also a GSL test.

B

There's a software installed from e4s stack and how to use package from there and the readme also has the the instructions how to do the job scripts or as alloc, using the reserved nodes and to to to use during the reservation hours, make sure to use the project and train 8.

A

Yeah and with that I will thank you for listening, and um you know you can put questions in the Google Doc or you know later. If you ever need help, uh you know submit a ticket be happy to help. You.

B

A

B

For a very engaging interesting talk, lots of cool tips, fun, analogies and and cartoons.