National Energy Research Scientific Computing Center (NERSC) Migrating from Cori to Perlmutter Training, March 10, 2023, 10 Mar 2023

Previous Meeting

⏯

youtube image

►

From YouTube: Migrating from Cori to Perlmutter: GPU Codes

Description

Migrating from Cori to Perlmutter: GPU Codes
Presenters: Muaaz Awan, Stephen Leak, Helen He, User Engagement Group
Training: Migrating from Cori to Perlmutter, March 10 2023

A

Okay, so my name is Mars and I'll be uh going over uh some some of the tips and tricks that you may need, if you're trying to run your uh trying to Port your coach to the GPU nodes in particular and before I begin. I would like to thank Steve Van Helen for for the contribution to these slides and the materials, so uh I think most of the things that you need to get started on print matter have been covered by Eric very nicely, and this.

A

This presentation is mostly uh describing the particulars of the GPU nodes, uh for example the programming environment and the architecture of the nodes. And later we have uh a few Hands-On exercises. But before you get to them uh yourself, I'll do a walkthrough of those uh through using the uh the time that I have for this presentation and I'll try to convey some of the concepts that we are trying to explain through those Hands-On exercises.

A

So, as was described by Jack in the morning, the the GPU nodes on parameter. uh We have about 1792, GPU notes on parameter and all of them have same architecture except a small change out of these 17 192 notes.

A

1536 nodes have uh a100 gpus uh the their 40 GB variant, which is that the size of the HPM is 40 GBS on each GPU, while 256 of these nodes have the atgb variant.

A

So each of these GPU nodes also contains a host processor, a CPU which is the AMD Milan, zip view it contains 64 cores and, as was described by Eric. These are the hardware cores and each core contains two logical, CPUs or hyper threads. So you have a total of 128 logical, CPUs on the front motor GPU node and each GPU node contains four Nvidia a100 gpus. The the slight difference between the gpus as described above, is that 256 nodes will have the gpus with 80 gb HPM each.

A

So all the gpus on a GPU node are connected via NV link connections. So it's a it's like an all to all connection and the CPUs and the gpus can communicate via the PCI Gen4 bus. Each node also contains four slingshot 11 nics, as described over here, and these are connected to the CPUs via a PCI Gen4 connection as well. I think this was also a question uh in the in the first presentation: uh what is the 256 GB ddr4?

A

So this is the ram that you have on on the nodes, and this is separate from the the gpu's memory, so each GPU has 40 GBS or 80 GBS of memory. uh That is the high bandwidth memory, that's available on the GPU that you can utilize and then on the host side, we have 256 GPS of ddr4 memory available as well foreign with this. Let's move on to the the programming environment. uh So when you log into parameter, everything is by default set for GPU nodes.

A

So you won't need to make any change, and you can check that by listing the modules that you have loaded, you'll see that a module GPU is loaded and what this module does is it. It makes sure that all the required environment variables and the compiler wrappers are set up for GPU builds you'll, also notice that you have a the quota, toolkit module and the creepy Excel nvidiait module loaded. So these are things that are required if you want to use the GPU specific features of the of the node.

A

The default programming environment is new, so if you log into perimeter and do a module list, this is what you will see. You'll see that the GPU module and the GPU specific modules are loaded along with the new programming environment. But if you want to use the Nvidia programming environment, you will have to change to that by default. It's big node and let's talk a bit about the the compiler wrappers, so the compiler wrappers are something of uh you know. It makes things really easy for you.

A

Let's say that you want to build a code using the gnu compilers and you want you would want to use the G Plus compiler if you're building a c plus code, but when you're building a GPU code, there are a lot of libraries that you want to link and, and they could be all over the place and we do not want you to go through the process of you know finding the path of each of those and making sure that it is being LinkedIn.

A

uh What you want to do is you want to use the compiler wrappers so, regardless of the type of programming environment that is loaded, uh your compiler wrapper will make the call to the to the right, uh compiler and Link all the required libraries. For example, if we have the new programming environment loaded, we are basically trying to work with the gnu compilers and if you use the capital, CC compiler wrapper.

A

That's used for the C plus applications, and you will see that underneath it it's basically using the G plus plus compiler, and if you use the small CC wrapper, it would be GCC compiler. Underneath now. Let's say that you want to use the Nvidia compilers, so you swap or you change the the programming environment to Nvidia by doing module, alert, programming, environment, Nvidia, and then you can check the version of the compiler by using the compiler wrapper, and you would see that the NVC plus plus and the NVC compiler are now being used.

A

So the compiler wrappers really make things easy for you, especially when you're you're working on GP notes, because with the GPU you have to link in the the different runtimes here, foreign.

A

Documentation and a kind of a similar image was shown by Jack in the morning. This is the uh all the programming models that we have available and the programming environments that make them available. For example, Nvidia. uh If you do reduce the programming environment in media, you will be able to build your program in Cuda, open, SSC, openmp, uh Cocos and Roger as well.

A

uh While there is another experimental program environment that is the nurse programming environment or the llvm, uh also in in the in the pipes, uh I think you will have access to it soon. The experimental version can still be accessed. You can check our documentations for that, so this provides I think the the widest uh coverage it even provides coverage for the hip and the sickle programming models. These are the programming models that are used by the AMD and Intel gpus.

A

So if you code in or Circle, you will be able to run on AMD and Intel devices uh respectively, while both of these can also be run on the Nvidia gpus that are available at nurse.

A

uh So once you have decided what programming model that you want to go with, uh then we have the recommendation of the programming environment that you want to use I'm guessing a lot of the applications are already there, where you have already decided the type of programming model that you want to use. So uh everything uh is all uh like. Most of the things are supported by the Nvidia programming environment, uh while Cuda and cocus. uh We can.

A

We also recommend the new programming environment for them, uh so for Cuda it you typically want to go with the Nvidia and new, uh with the Cocos nvdn glue would also work uh with the for the open ACC and the standard uh C plus Library parallelism. You would want to go with the Nvidia compilers foreign.

A

With this. uh Let's move on to the Hands-On exercises. There are a few Concepts that are covered there, uh that I want to walk through before you start doing the Hands-On uh exercises, so the the record on this link contains two directories: uh a GPU, a directory for the GPU examples and another for the CPU examples here, I'll just go through the GPU examples. So once you move to the GPU directory you'll see, there is a readme file that is basically sort of a lab manual that you can walk through.

A

It contains instructions on how to build and run and what would be the expected output. There are some optional exercises that you can try out as well um yeah, so we try to touch uh everything. You know almost everything, that's uh that's very basic for the GP nodes. For example, you try to touch the tree programming models that you can use to run on on the GPU, we'll try to build them using different programming environments or compilers in particular, We Touch, Cuda, open, ACC and openmp. These examples are pretty simple.

A

uh Your codes may be more complicated. If that is the case, you can always reach out to us, and we can help you with the more more complicated things.

A

uh Then we have the uh some mixed examples where you have the GPU kernels being called by the uh from the MPI ranks and in some cases you want to use separate compilation, because not all the compilers are able to build everything, and then we have large.

A

Two examples are for the Cuda wear and the GPU Affinity Cuda, where MPI gives you the uh gives you the power of communicating between two gpus directly with the data from one GPU buffer can be transmitted directly to a remote GPU buffer and then just like the CPU Affinity, that was that was covered by Eric. There is a GPU Affinity, because there is a how you can bind your ranks to gpus to get the Optimal Performance.

A

Each of the exercises directories will contain a make file: a batch script, the S patch script and some source files. The make file contains these steps to build the example and the dispatch file contains instructions to run that code. You can use the DS patch file, the dispatch script directly or you can just you know, get get like in an interactive node and just run the executable.

A

uh The execution Rhino line over there, uh so the the expat script for GPU nodes would look very similar to what you you have been using on query or what you will be using on on the parameter, CPU notes. There are a few uh changes that I'll point out, so one major thing would be the change of the the number of CPUs per task. So, on the CPU news nodes, you have 256 logical CPUs that is 128 uh Hardware cores on the GPU nodes. We have have that number.

A

We have 64 Hardware cores and 128 uh logical CPUs. Now for slurm, one CPU is a one logical CPU. That is one hyper thread. So if you have, let's say in this example, eight ranks in total. You have two nodes, so you have four ranks per node. That means uh you'll need to assign uh 32 CPUs per rank out of the 128 in total, and let's say that, instead of for you had 64 ranks per node, then you would you would set C to C equals to 2.

A

Then you would because you want to have one core or two logical, CPUs map to one ampere rank. The other two things that you want to focus on when running on the GPU notes is the gpus per per task. That is the number of gpus that you have you want. uh You want to have per task available and the the constraint should be set to GPU, because otherwise you will not be requesting a GPU node in you know, specifically and for for the scope of this training.

A

If you want to use the reservation that we have for you set this reservation flag to PM underscore GPU underscore March 10th. So this will, you know, get you through the queue very quickly, because we have some dedicated notes available for this training.

A

uh When you're running or building or running on on the GPU, there can be some confusion. For example, if you have an openmp code and you're, not sure if it's running on on GPU or on the CPU, there are some uh debug environment variables that can be very helpful.

A

For example, if you're working with the new compilers setting, this variable will tell you uh when a kernel is launched or when a data transfer takes a place from CPU to GPU, or vice versa, and similarly working with the Nvidia compilers, you have more find control over what you want to debug and what events you want to be alerted about so uh make use of these variables. These can be very helpful because sometimes you don't want to use the profiler directly, because there is a larger overhead.

A

You just want to run your executable, so this basically prints out all the information that you need onto the console foreign with this. Let's move to the exercise, one so exercise one would just be a simple, a simple Cuda kernel and we will demonstrate it using two different types of files. So there is a cool file. There's a CPP file: if you have uh the Cuda API calls within a DOT CU file, that's a trademark extension for the Cuda files and that will be detected by all the Nvidia compilers.

A

So if you have the nvcc, which is a Kuda compiler, it will obviously detect it. But even if you have the NVC plus compiler, it will also detect without any specific flags being passed, but if you're using a different compiler. uh Let's say sorry. If so, if you are using the different extension, that is the dot CPP and you want your Nvidia compilers to pick up that it's a Cuda file, then you would want to pass a specific flag for that if you're using the NVC plus plus, then that would be Dash Cuda flag.

A

uh In the example in the exercise 2, we show the separate compilation. uh The Cuda code in particular, can only be built with the nvcc uh compilers I mean it's. It's specifically recommended, but I think the llvm can also do that. But it's recommended that you build it with the nvcc. uh It can get you the best performance and let's say you want to build your main application with a different compiler. Let's signal compilers, then you would want to use the separate compilation that is you first build your kernels.

A

You move the kernels to a separate file. You build it using the mvcc, generate an optic file and then link that object file using the compiler wrappers of new or whatever, the compare and compiler that you want to use in a later uh stage.

A

In example, three we show how you can use MPI along with Cuda, and this one is a much simpler example where everything is located within a DOT CU file. You can simply use one of the Nvidia compilers I, think particular NVC, plus plus through the compiler wrapper, and it would just detect what language it is because it's a DOT CU file, so it will be much simpler, but the interesting thing is that the compiler wrapper will be able to link the required libraries for the MPI as well uh the.

A

But the more realistic case would be that you have your Cuda kernels in a separate file. You have your NPI or the main host code on on a separate file, and you make calls to your quota kernels uh from that host file and that can be done using that can be built using a uh using the the separate compilation again and you can use your choice of host compiler here, but make sure that you use the uh the compiler wrappers, because otherwise the required libraries will not be linked.

A

uh So before we move on to the next examples, which include the uh something about uh CPU and the GPU Affinity. Let's have a quick recap of the how to set the number of CPUs per task: uh Corey Hazel and Corey K L you're, aware of on the parameter CPU, as was described before, we have 128 physical cores, while on the GPU notes we have half of that, and if we want to set the the dash C flat. That is the number of CPUs per task.

A

This is a simple uh formula that you can use, so you divide 64, which is the number of Hardware codes that you have on the Node, the CPU Hardware course and divided by the task uh number of NPI tasks per node and slower that you know round off to you, know, wrap around it down and then multiply it with the two. For example, if let's say, if we had 64 tasks per node, the thing inside bracket would be one. You multiply that with two, and that gives you two.

A

So that is the two CPUs per task that you're setting it uh with this. So this is an example of so yeah. You already saw that in order to make sure that we are getting the best out of our nodes, we want to set the CPU by inflect course that will bind your tasks to course, but there is another thing on the GPU node that is the GPU binding.

A

So if you do not set uh GPU binding, then all your ranks will have access to all the gpus, but gpus are are mapped to particular Newman nodes, so numenode is uh is like a region, a physical region in inside your node. It was very nicely described by Eric, so I'll not go into the details, but let's say that so on your GPU uh on your GP nodes, you have four Newman nodes and each pneuma node is uh tied to a particular GPU.

A

Now, let's say that your you have a rank that is bind to pneuma node 0 and it tries to access a GPU that is located on Newman node 2. Then they are physically far away. So there will be a penalty for that So. To avoid that sort of thing we recommend GPU binding so that your gpus, uh so that your gpus are, as you know, in space, they are set closer to your MPI tasks, and that is what we are going to explore in this example. uh So in this example, we have two batch scripts.

A

There is one is the regular one and the other is the GPU binding one. So first the the regular one will have have no GPU binding and you will you can see that your execution line would look something like this, where we are just finding the MPI ranks to cores and we are not setting the GPU binding here and it will print out. Each rank will print out all the gpus that are visible to it and the GPU that is assigned to it. In this case we are assigning gpus in a round robin fashion.

A

So let's have a look at this highlighted example of rank 1, which is bound to core 16 now core 16.. If you look at this node map, you can see that the core 16 is located in pneuma node, one now Numa node one has the the GPU with PCI address 82 tied to it. So ideally we would want the rank on node on on core 16 to have access to this GPU.

A

But if you look here, the the core 16 rank is has been mapped to the GPU, with PC address 41, which is actually located in New pneuma node 0..

A

uh So this is not ideal and you can also see that it has. It can still see other gpus as well, and any of those could have been assigned to it. It totally depends on how you map a program programmatically, but if you use the GPU bind Flags. This is what the output would look like if uh in over here in this particular I'm setting the GPU bind flag to close this.

A

So it will map each rack to the GPU that is physically closest to it, and you can see that it uh in this run, the core 16 rank, widgets are named. Rank 2 now can only see one GPU and it is the PC address 82 GPU, and you can see that it's located in pneuma node one, uh and that is the same pneumonode where the core 16 is located.

A

So just setting this simple slag will uh you know, make it through some performance Improvement, because now your your gpus are physically closer to the to the ranks. uh Try this out. There are different settings that you can use. You can check more on the sked MD website and GPU bin equals to close assist. One option, but you you can multiple options. You can even do the custom mapping of gpus ranks.

A

uh The other uh interesting feature is uh of Cuda, aware MPI, so with the unified virtual addressing uh technology which basically allows uh the uh the GPU device memory to appear, as part of uh you know, same address space. As of the CPU memory, uh like has been shown on the figure on the right, it allows us to make a direct transfer of messages uh from one GPU to the other.

A

So, basically, if you want to send some data from the buffer of GPU from one GPU to a remote GPU that can be done directly and it will basically bypass all the communication, like you know, going through the CPU memory, and that is the typical route that could have your MP allows you. You know much faster, a communication. This way.

A

uh Cuda the example. Six will demonstrate how to do that. uh Basically, you just uh use the GPU module that will already be loaded in your environment.

A

uh You can just build your example uh as you would, and then you can verify uh that it's actually, you know linking this particular GTL to the library by checking over the the list of library that have been linked in this basically indicates that your code is going to make use of the Cuda where MPI capabilities the example that we have uh will uh we have two ranks uh two remote ranks.

A

uh One will send some data to a remote GPU on the other Rank and the other rank will read and print it to the screen. So it basically shows uh you know you can go through the code and see how that's done programmatically and that it's actually being done uh in the example.

A

uh The the last example has two parts, so here we describe with a simple example: we describe how you can use and build for open, ACC and openmp. These are other programming models. These are more portable in nature than the Cuda uh programming model, which is a kind of very specific to Nvidia. So if you plan on running on different architectures, uh these are some programming models that you can look into. Openmp has I think the the widest of widest support foreign.

A

We should move on to the the Hands-On uh section uh we will be around and if you have any questions, please uh reach out to us. uh Thank you very much.