National Energy Research Scientific Computing Center (NERSC) Migrating from Cori to Perlmutter Training, December 1, 2022, 1 Dec 2022

Previous Meeting

⏯

youtube image

►

From YouTube: 04 Migrating from Cori to Perlmutter GPU Codes

Description

Part of the Migrating from Cori to Perlmutter Training, December 1, 2022.

Please see https://www.nersc.gov/users/training/events/migrating-from-cori-to-perlmutter-training-dec2022/ for the training day agenda and presentation slides.

A

Everyone, my name, is Moz and I, along with Steve and Helen, will be presenting about the GPU notes that we have on the perimeter system so first and outline or overview of this presentation, uh we'll have a quick look at the GPU notes, their Hardware configuration and then we'll move on to the type of programming environment status that are available over there.

A

And finally, we have some Hands-On exercises which are self-guided uh setup for you, but before you start working on them on your own I'll do a quick walk through to them and there are several Concepts that are being explained there that are new to these notes and we're not available on the query system or are available on the current model.

A

Cpu notes, so I'll try to walk through them with some examples, and once we are done with that, you can try, and you know you can try them on your own, because if you do, you learn better, so uh the GPU nodes. uh So we have about 1500 GPU notes on the permanent system, and each of these nodes contains uh one AMD Milan CPU. This is the very same CPU that is also present on the perimeter CPU nodes.

A

uh The difference is that on the CPU notes we have two of these and on the GPU nodes we have one of these. Each Milan CPU has 64 cores where each core has two Hardware threads. So for the scheduling system, it will see a total of 128 compute elements. uh Cpu elements on each GPU node, along with that we have the distinguishing factor, is the gpus.

A

We have four AMD A1, sorry Nvidia, a100 gpus, uh and each of these GPU has 40 GBS of HPM the high bandwidth memory and is capable of Performing up to 9.7 teraflops of floating points, uh the Double Precision flowing, find operations, uh and each pair of these gpus on this node is connected with a Envy link connection, while the CPUs and gpus communicate through a PCI Gen 4 best. So it's a highly performant node with that we move on to the programming environment.

A

So the programming environment is very simple, similar to what is available on the on the CPU node, except we have some specific modules that are available or are loaded onto the GPU nodes when you have to build your code for the GPU, except that everything is same as Eric explained. So all the model trick strategic here that you can still use them on the GPU nodes. In fact, the code will basically be built on a login node, so everything will remain the same over here.

A

I'll try to talk about the differences uh that are and that you must be aware of when you're building for the GPU nodes. Now, if you log into your terminal on Perimeter- and you do a module list, some so something like this will show up- and you can see that by default we have this GPU module on number 18. You can see load it. So when this is loaded, the system is configured or the environment has been configured for GPU codes.

A

And if you are looking to build CPU code, then you will have to unload this, but by default this model will always be there and we are assuming that you're building for the gpus. Now, what this tells is it loads? Some additional modules, for example the cooler toolkit and the creepy Excel nvidia80 module, but these modules are required uh for for building for GPU and by default you can see that the environment is the new environment.

A

uh There is a thing known as the Cuda, where MPI and, if you're, if you want to you utilize that feature we'll get into the details of that, you want to make sure that you have the GPU module loaded and once you have everything set up, it's recommended that you use the compiler wrappers to build now. uh Eric talked about compiler wrappers in detail and I'll also give a few examples of that now.

A

By default, the compilers that are unloaded are the gnb based and you can access any programming environment or the compilers that are loaded using the compiler wrappers. For example, if I have the compiler, a compilers loaded, I can check with the compiler wrapper CC and the capital CC, with doing with the dash dash version, you'll see that the underlying compiler is G plus plus. Similarly, for the c c language, you can do the small CC and you can see that you will have GCC available now.

A

Let's say that you want to use a different compiler, let's say: programming, environment Nvidia. You want to use the Nvidia compiler, then you do the modular programming, environment, Nvidia, and then you can see that some changes happen in the environment and if you do the CC dashboard again you'll see that we have NVC, NVC plus plus showing up now. This is the Nvidia compiler and uh do not use NVC plus plus directly always use the compiler wrappers and why this is important, we'll see when we go into the Hands-On exercises.

A

It's basically because the compiler wrappers link to many libraries that you don't know where they are located and if you try to do everything manually, it will be a lot of work.

A

On uh parameter, we we support several GPU programming models and we have almost support for everything available there. There are certain programming environments which suit better for certain. It just suited better for certain creating models. So this is a list of our recommendation. If you're working with a Cuda code, it's recommended that you use programming, environment, Nvidia, uh naturally, because Cuda is nvidia's proprietary, but let's say that you have an application that uses uh the gnu compilers the uh you can still use those, but then you will have to do a separate compilation.

A

You'll have to make sure that your Cuda code is built using the Cuda toolkit or the nvcc compiler. We have a Hands-On exercise about this as as well and I'll. I'll point this out. When we get to that, Cocos is a c plus Library. uh Can you present it as a C plus plus Library, so anything that supports C plus plus and has the uh the backend support for the gpus uh you will be. You will be able to use that programming environment for it.

A

Then we have openmp offload it's if you are looking for portability across different types of gpus. This is one of the most. uh You know recommended programming model. If you, if you have a code in openmp upload, it will work on uh Nvidia and AMD, or even on Intel gpus on prenwater. We are supporting it using the programming environment Nvidia and there are options and programming environment created that also allow you to build the code, that's in the openmp output and when we go to the open ACC model uh on Perimeter.

A

We highly recommend that you use the programming environment Nvidia, because that has the best support for this, and then we have the the standard, C plus parallel Library support as well through the Nvidia programming environment.

A

To summarize all the last three to four slides, if you have a source code, does not matter what it contains: either the GPU directives or MPI or or like apartment code. We recommend that you build it using the compiler wrappers instead of using the underlying compilers. For example, if you have a C plus plus code, you use the capital CC compiler wrapper. If you have a c code, you use a cc and if you have the Ford front code, you use the SP and rubber.

A

Now this is also regardless of the type of programming environment that you are using. It doesn't matter. If you have a new programming environment, create or Nvidia, you always stick with the compiler wrappers, because they they will take care of the compiler that is to be used and the libraries that are to be LinkedIn now the exception would be a Cuda kernel.

A

If you have Cuda kernel inside a DOT CU file, then uh the the famous method is to use the nvcc, uh but on firmware we have programming environment Nvidia, which contains the NVC plus compilers, and you can even build the Cuda code with them directly. You just have to pass the appropriate flag, and we have also a Hands-On example that explains this nicely.

A

Now, with this, uh we move on to the Hands-On exercises, and that is where the bulk of information will come from uh you. You may already be aware of this link uh this. uh There is a GPU directory in this repo and, if you CD into that you'll see a long readme. That readme is basically kind of a lab manual for for these training.

A

For these exercises- and you can read through it, I would suggest that you open it into a separate window and open all the code in your terminal, and you know, read through the readme and uh try to follow the steps in the exercises. There is no uh it's not kind of an assignment. You don't have to do anything on your own. You will, you will be perfectly fine if you just follow the steps and I would highly suggest that you open up the make file and look into it.

A

Just ignore the code for now, because code is just like a boilerplate, a vector, Edition code, but make file is what's important, just look at what compilers are being used, how compilers are being used and what compiler flags are being used.

A

So what's covered in these these exercises we start off with a simple Cuda code. We try to make it more complicated. As we move forward, we add NPI into the mix. We try to build the MPI plus Cuda code with different types of programming environments. Then we talk a bit about the the Cuda wear example, where you are able to communicate between two gpus across node directly and then uh the the GPU Affinity, just like Eric, explained the CPU Affinity. We.

A

We also have Affinity issues on these nodes and finally, we have some exercises that talk about open, ACC and openmp on parameters.

A

As I mentioned. It's uh there are two important files in each each exercise. A make file and a batch.sh file. Batch.Sh file will mostly be the same uh very similar, uh except when we talk about Affinity uh make file is the most important one. In typically in a training you would. You would think that a source code file is more important, but since here we are trying to learn how to use the programming environments and how to build your code so the make file here it takes present the code.

A

The source code would almost be the same in all the examples, so the the batch file is basically used for launching a job in in an efficient manner. If you get a dedicated node and try to run on it, you will basically be wasting time because a lot of time you'll be spending. You know, reading the file and building it. uh But if you run the file through the batch system, it will be much easier. So uh there are different options. uh These may not reflect exactly what's written in the in the batch file.

A

That's included in the examples, but the whole the overall concept, and then the terms are pretty much the same. uh The dash queue option specifies the the qos or the quality of service or the queue that you want. Your job to go in the dash capital. N is the number of nodes that you're requesting Dash. T is the time in minutes, for example, this is five minutes. uh Dash n is the number of MPI tasks.

A

The total number of tasks across each nodes that will be launched and dash C is the the number of CPUs that you're requesting now be mindful that first learn. A CPU is a hardware thread. For example, in the start, I mentioned that each Milan CPU has 64 cores and each core has two Hardware threads. So in total we have 128 compute elements and and slurm sees each compute element as a CPU, as so the it it's.

A

The total number of CPU elements that you're requesting, and then we have the number of tasks per per node uh yeah. So that's that's just very uh explanatory self-explanatory uh and then we have the number of gpus per task. So this is the so basically you have four tests per node and you have one GPU per task.

A

So that's the total of uh four gpus that you're requesting and uh for this for the purpose of this uh exercise, you will be using N Train two as your location account and uh your reservation would be PM underscore GPU underscore just number one. uh Also, it's important to use the constraint GPU, because if you don't uh that, then you know you're not requesting a GPU node uh for the CPU nodes. You'll replace this with the CPU. So for the for all these examples make sure that you're requesting a GPU node.

A

uh These are some uh useful, runtime environment variables a lot of time. It happens when you're working with the GPU code, especially uh within openmp of the type of code, uh because openmp also runs on CPU. So it's it's important to know that the example that you ran actually used the gpus. So if you set these uh variables to these given values, you will know what's actually happening.

A

For example, if you set this to uh to to 2 that will tell you the data transfers that that are happening so when the data transfer happens between the CPU and the GPU, so there are multiple options: try to explore them, and uh this will also help you understand if your code ran on the GPU or it's it's running on CPU.

A

In the exercise one, uh this is the same test of all. We have a simple Cuda kernel and it has been placed in two different files. uh One file is named dot cu, the other is named dot CPP and we first try to build the dot CU file that contains a Cuda kernel that runs on the GPU with the nvcc compiler. This is a dedicated compiler for for the Cuda language, and then we do the same exercise using the CC wrapper.

A

Now, when you're doing this, make sure that you have the programming environment and video load it because then this Cc or apple wrapper will be using the NVC plus compiler, and that is the only compiler that can build Cuda code when you have path, attack secret, flag and I would suggest that you, then repeat the process by loading, a different programming model, let's say gnu and then repeat with the with the CC wrap, and you will see that this fails and that's good, because that's we're trying to look that.

A

If you have a Cuda code, you need to have the programming environment and video load it or you have to use the nvcc, the Cuda toolkit compiler for building the Cuda code.

A

Now it really happened that you have a code that only contains Cuda and is all in the same file. Typically, you will have your application will be large and distributed across multiple files and as a good practice, people try to keep all their code in a separate file or all their GPU code in a separate file, and that actually makes it easier. So let's say that you have a application that makes excessive use of Cuda and all the Cuda kernels are located in a separate file named kernels.com.

A

In this example, we just have one kernel in this file, so what you would want to do is you you will either want to build the Cuda code separately using the nvcc compiler and then link it with your application, or you could just use the the CC wrapper with the programming environment and video loading, because we just saw in the example one that the Nvidia compilers they will build quitter for you, even if it's you know all mixed up in the actual code.

A

But if, in a scenario you want to use a different compiler, then you can use this method where you can build your Coda code with nvcc and then use your uh it doesn't matter what compiler, even it can even be GCC or a new compiler. You will still be able to link it, so this is. This example covers that that, in the third example, we uh include MPI in the mix. So in this example, we have MPI and Cuda code.

A

All in the same file- uh and uh there are and the best way to do that, because if you use nvcc here, nvcc compiler will not recognize MPI and that's why you will need to use the compiler wrapper that comes with the programming environment and video. Now this is again one of those cases where it will only work. If you use this particular module and you use the compiler wrapper because then it will be linking an MPI and Cuda both.

A

In the fourth example, we again come back to the separate compilation. Here we have uh the the MPI library and we have the Cuda kernels, but the Kuda kernels here are located in a separate file and you can again use any programming environment for this, because you're building Cuda code separately using nvcc and then you're linking it uh using the compiler wrappers compiler wrappers can again come from any uh compilers. uh It's recommended that you use the compiler wrappers.

A

You could still build the code if you were using, let's say G plus plus, but then you would need to write a bunch of you'll need to link in a bunch of libraries, for example, good or random library is one of those that you'll be needing, and since you don't know where it will be located, and what uh what are the paths you need to include? It's always recommended to use the compiler wrappers, because they will take care of everything for you and it will also, you know, make the compilation line. Look much much simpler.

A

uh Before we go to the next exercise, uh this is a slide. That's copied over from Eric's slide deck, and it's basically telling you the compute elements that are available. So we can look at the rightmost column. It's the CPU on the perimeter GP nodes, as was mentioned before that on the parameter CPU you have two of those sockets and on the on the GPU nodes you have one of those, so everything is actually hacked like the total physical code is have the logical, CPUs per physical curve, which is the which is the hardware thread.

A

They still remain the same because they are located within the uh within the core. uh Total logical, CPUs per node is also halved, and the new model domain is also have so. We have four Numa domains uh while parameter CPU nodes have eight numer domains.

A

uh Before we go uh into Infinity I'll try to get this out of the way, so the affinity for the CPU CPU cores is still the same as on the CPU nodes, because nothing has changed here. So it's recommended that you use the correct number of uh uh you know you assign the correct number uh to the taxi option or the CPUs per task option.

A

If you write in the longer format and to to compute the correct number, you can use this uh equation, it's pretty simple 64 is basically the total number of uh Hardware uh course that you have on the Node and K is the number of tasks per node. So, let's say, if you had 64 tasks like in this example, uh the in the thing inside the braces would become one and you will. The answer would be two and now 2 is the number of Hardware threads that you're assigning to each MPI task.

A

Now, let's say that instead of 64 tasks per node, you had 32, then this would become 4 and you'll have four Hardware threads assigned per MPI task.

A

So it's important to have this right, because you you want to make sure that you're utilizing the resources best and you know, you're not pushing uh too many MPI tasks onto a single core when you have more available and to one way to push that is to add, the CPU bind equal to course option, and next we will look at the GPU affinity uh now, as described before uh on GPU node.

A

We have four new my domains and each numer domains contains a memory which is faster to access if access from within that in my domain and across different numer domains, Things become slower. Similarly, just like each domain has its own memory. Each domain will be assigned a GPU, so this is something it's what it looks like on the GBP. You know we have four new my domains and each new my domain gets a GPU.

A

Now, let's say that you had a MPI task which was residing in Numa domain, let's say zero and it was assigned a GPU or it was trying to communicate with a GPU that was assigned to Numa domain 3., then literally, it will have to take a longer path and things will slow down. So it is important that MPI tasks are assigned gpus that are closest to them and we'll see how to do that. So in exercise 5, we have two batch scripts.

A

There is a batch script wreck and a batch script uh close the The Red Badge script is just the regular way of running things here, uh the every uh we don't specify any affinity and what what will happen is that every rank or every MPI task will be able to see all the gpus that are available on the Node and in a typical application. When, when each rank is able to see multiple gpus, we assign one GPU to each rank in a round drop in fashion. Those of you have been actively developing or porting.

A

Their codes to gpus would be aware of this. Now. What that does is it does not? Really care about if the GPU that is being assigned to a certain task is closest to it or if it lies in another new domain, to make sure that you're getting the closest GPU. You have to specify this flag, GPU buy and equal to closest, and this has been demonstrated in this close dot search example.

A

Now, when you run the code without the uh the GPU Affinity set you'll, see something like this printout. You can see that the rank 1 is able to see four gpus and it randomly or around in a round robbing a fashion assigns a GPU itself and we print out the the PCI ID of that GPU to differentiate. uh You know what GPU is being assigned to what rank, but it does not really clear if it's the closest and you can see that every rank can see every GPU.

A

But if you relaunch the same thing with the GPU closest Affinity set, you will see that each rank is able to only see one GPU, and that is the GPU that is closest to it. Now. How do we know that it is the closest to it? So when you run this example, some information about the node topology will also be printed out and we can use that information to see that we are getting the closest GPU.

A

Now, let's look at this orange highlighted region, the the rank 5 and you can see that it is residing on core 33 and it has been assigned GPU number 41..

A

Similarly, Rank 4, which resides on core 32, has been assigned GPU number 41 as well. Now, let's see where these cores reside. Actually the code 32 and 33. from this we can see that code, 32 and 33 reside in Newman, node 2., and if we go to Numa node 2, we can see that the PCI PCI bus, ID or the GPU ID that has been assigned to that node is also 41..

A

So this all lines up you can you can you know you can run this multiple times on multiple different nodes and you can try to verify that it is actually working and then you can try to run the examples where we don't use the GPU binding and try to compare the results.

A

And then we have the Cuda aware MPI. So uh Nvidia has this uh uh thing uh known as UVA or the unified virtual, addressing uh what it does is it. uh It allows the program to see the CPU and the GPU memory in a single virtual space, and what this makes possible is direct communication between two gpus. Let's say you have two nodes: uh for example, in this case you have node one and two, and you want the GPU one to send a message on GPU one on node 2..

A

Typically, what will happen is that this message will be first sent to the CPU memory on on the parent node. Then it will be sent to the CPU of the of the target node and then it will be sent to the Target GPU. So that's a longer path. But if you have the Cuda aware MPI option available, you can send the message directly from one GPU to the other on a remote node, and that is a facility that we have available on Perimeter.

A

uh The the Korean pitch build that we have it Targets this it utilizes this underlying technology, and that gives you the performance uh that that is capable of direct communication. uh Now how to use this. So as as I showed before that by default, a GPU module will be loaded.

A

Now, this module does a lot of things, it loads other modules and also makes sure that there are certain environment variables that are set up for this type of Cuda wear MPI to take place, and if you have this module loaded, you don't have to do anything else. You will just build your code as usual and everything will be taken care of.

A

uh Sometimes you may be running into issues, and the first thing that you would want to check is: was your executable, Cuda, MPI, capable and, and you can simply check the type of libraries that were Linked In? You can do the ldd or the executable or the library that you have that you're using and you'll see that you'll have this uh Library. You should have this Library Linked In And.

A

If it is, then uh that means it was built for cudavere MPI, and then there may be some other issue, and you can, you know check with us about that.

A

So example, six uh tries to explain this concept. It is a simple example that shows you how you do this. It sends a message to the GPU located on a remote node and try tries to verify verified that the message was received correctly uh and then in the exercise.

A

Seven, uh we explore open, ACC and openmp offload methods, so these are two programming models, other than Cuda that you can use to Target gpus, and this this example contains the same kernel that previously we had in Cuda in the in the other example, and just rewrites them in openacc and openmp. So you can, you know, compare and contrast these three kernel and see how these three models differ from each other. uh Over here we have tried to keep open, University and openmp codes in the same file. We've separated them using IF stuff statements.

A

So you can, you know, easily compare and contrast between those two. Now uh there are you can you can use so for openmp? You can use programming environment create as well uh and uh for open, SCT and openmp. They both can be built using programming environment Nvidia. Now this is the model that is recommended if you want to Target these two.

A

But if you have a some, you know serious dependent dependence on create programming environment, uh then you will have to go with the openmp now in order to build a code for openmp, you need to pass the dash MP equal to GPU flag to the CC wrapper off, that's contained in the programming environment, Nvidia, and if you want to build for the for open ACC, you will pass the flag Dash ACC and the example uh So within the exercise. 7.

A

There is an example that also explains how to use open ACC uh and uh in fortron uh using the photon wrappers. So you can also look at that. If that is of interest to you, uh that that is all uh from my end. uh Thank you very much and uh I think we have some some time left, so you can try to walk through the uh the exercises the Hands-On exercises, and we will be available here to to answer your questions.

A

Thank you. Very much.