National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 10. CUDA C++ Basics

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

So in this last uh session, what I would like to do is say a little bit about cuda.

A

We decided to do cuda last because we wanted to demonstrate that there are plenty of ways to use the gpu that don't depend on you having to drop down to a lower level, in particular, the main distinction between cuda and the programming models that we've discussed so far is that in cuda, you're expected to do a little bit more effort as the programmer mapping parallel threads to items of execution work. If you think about the models that we've discussed so far, parallelism in the standard language, as well as openmp and openacc.

A

These models didn't require you to explicitly map parallelism to threads you. What you were doing was telling the compiler what work you wanted to do and where you wanted it to happen, and then the compiler was figuring out how to map that to parallel threads on the device as brent. Just you know, discussed you can take a little bit more control over this process with some optimization work by telling openmp you're opening cc how many threads to use per gang or per team.

A

You can also tell it how many teams or gangs to use, but you aren't required to, and in any case you still aren't required to figure out what those teams and threads are doing um you're just telling the the compiler how many to use as a tuning exercise.

A

So the fundamental difference with cuda compared to these other program models is that you are required to map the work to parallel threads.

A

Now cuda is a kind of overloaded term and I I kind of hinted at this yesterday when I was talking about the profiling tools.

A

Cuda refers to both the overall architecture or platform that nvidia provides that exposes gpu parallelism for general purpose computing and it's the it's you can think of it as the underlying platform or architecture that all nvidia based program models can map to so whether you're, using openmp or cocos or openacc, or language parallelism or the math libraries.

A

It's correct to think about all of these things at some level as sitting on top of the cuda platform or architecture.

A

But another overload of cuda is the specific extension of this platform in c plus plus cuda c plus, is an extension to or a set of extensions to, standard c, plus plus that give you the ability to manage gpus and launch work and also manage memory that can be used on either the cpu or the gpu.

A

So you can think of it as the most direct way, at least in a high-level language like c or fortran, the most direct way to access the gpu.

A

That being said, it um does require a little bit more effort than the models we've discussed so far, I'm only going to talk about cuda as it applies to c or c plus in this um presentation, and I do call it cuda c, plus plus, because technically that's what it is technically, there is no such thing as cuda c, but for the most part um you can use standard c in in cuda as long as it's the part of the subset of c that doesn't conflict with c plus.

A

There are other languages that have bindings to cuda. The most popular ones today are in python and in fortran. Both of these give you the uh a way that is either pythonic or fortran, like fortran like in order to write cuda, and both of these are models that are supported by nvidia. I'm not going to talk about them today, but if you do see, cuda, fortran or python out in the wild you'll be able to recognize it based on cuda c.

A

Now the fundamental difference between the way that cuda exposes parallelism on the gpu compared to the previous models that we've talked about is that cuda explicitly requires you to identify kernels. Remember the kernel is kind of the fundamental unit of work on the gpu. That's the actual, discrete bit of work which you launch on the gpu runs for some time and then completes and in any given program, you'll probably have many kernels that you run in the context of a model like openmp or openacc.

A

We often think about this. In terms of parallel loops, you have an explicit for loop or do loop that you want to parallelize on the gpu. You apply a directive to that loop and you go in the context of standard language parallelism in fortran. It looks pretty similar.

A

You know I do concurrent loop you've still identified a specific do loop that you want to paralyze in the context of c plus it's a little bit different because you might use an api like stood, transform or stood for each, but you could still think about that as mapping to something like a for loop under the hood.

A

In cuda, we expect you to think about things in terms of functions, not in terms of explicit loops.

A

A function that can run on the gpu in cuda has to be marked up with the keyword global with two underscores on either side, as I've indicated here, and this is the only way in cuda to launch a function on the gpu or the device, as I'm calling it here. That's just jargon on the device launched from the host or from the cpu and that by the way, is under the hood. How these other models work at the end of the day, regardless of whether using openmp or openacc.

A

It is the case that a function is being generated like this one that can run on the gpu.

A

So global is a cuda. Specific keyword is an extension. It is not a standard t plus plus keyword, but it is a keyword that you can apply in front of a kernel, a function definition in order to make it a kernel that can run on the gpu cuda, because it is an extension to c plus is not going to be understood by every compiler.

A

Nvidia provides nvcc the cuda compiler, which can parse this. It both looks for any uh functions or kernels that are defined in the code in the source code that have this global attribute and.

A

My phone is ringing: sorry about that.

A

B

Away, I saw that you just unmute it.

A

Okay, um sorry about that, so um right, so device functions, uh kernels um that have this at this global keyword are processed by the device or the gpu part of the compiler toolchain, and uh everything else that is standard c c plus plus is going to be compiled by your standard host compiler. So on linux, it would be g plus is what compiles that part of the code?

A

The other bit of new syntax required for launching such a function on the gpu is what we call an execution configuration so that refers to a bit of code, a function call in your code and if you were calling that function in standard c or c plus plus you'd, say my kernel and then parentheses.

A

But this these triple angle brackets or triple chevrons, with a pair of numbers separated by a comma in between them, are cuda specific syntax, which means that this function call can actually be launched from the cpu onto the gpu, so any place that you see these triple angle brackets or triple chevrons. With this pair of numbers in between means, I want to launch a cuda kernel. I want to launch a kernel on the gpu. That is, I want to actually start executing gpu work.

A

If you don't ever have any of these calls in the context of acute source code, there will be no gpu kernels launched and all of your code will run on the cpu.

A

We will in a moment return back to this question of what are these numbers, but suffice it to say that one comma one means that I want to launch this function to run on gpu, with only a single gpu thread.

A

That means whatever the contents of my kernel are, will run on the gpu and there will be a single thread that executes that, based on the gpu architecture, discussion that brent led yesterday. Hopefully you understand already that if you are ever in a situation where you're only running a kernel that launches one thread, you're very likely to be using the gpu ineffectively from a performance perspective, but.

B

It's certainly valid.

A

And legal to do that.

A

Okay, so that's the example where we we've talked about how you launch things on the gpu and then how you mark up the function with the right syntax so that it can be launched in the gpu.

A

um How do we run code in parallel, because that's the more interesting case and in practice, use gpus for parallel code?

A

The first way we'll talk about doing it is by replacing the one comma one, which is the execution configuration with n comma one.

A

What this means is that we're going to execute this kernel n times in parallel the example that we'll use here is to add one vector to another vector, and let's imagine that we have a vector of length n. There are n items in in arrays a and b, and we want to add array a to array b.

A

We could run this with a single thread. Serially, you would get the correct answer. It would be very slow.

A

If you replace the one comma one with n comma one, this means that we're going to run this function add which can be a kernel marked up with the global attribute that will be run n times.

A

Now, if the body of the function is run n times, then that means that you want to that. Now you have multiple executions, instances of the body of that function. In parallel, they have to be working together in concert to solve a problem instead of doing the same thing redundantly.

A

So how can we write a cuda kernel that can add one vector or array to the other? This is what our add function will look like what it's going to look like is it has this global keyword here? Otherwise, it's a standard, c function, definition and um I'll describe what these this new syntax means in a second, but fundamentally, it will look like writing b at some index equals b at the same index plus a at some index. So this is just adding the array a to the array b.

A

If you were doing this serially on a cpu with a for loop, you would write 4, I equals 1 to n, or I equals 0 to n and b of I equals b of I plus a of I.

A

If you're doing this on gpu, we in a sense will be splitting up that for loop, so that it's run many times. The body of the function is run many times in parallel, and each invocation of the function is going to do a different component of the work.

A

We're going to say that each parallel indication of add is referred to as a block where the set of blocks is referred to as grid, and every invocation of the body of the function can refer to the index of that block. Using this variable block index.x.

A

So, instead of having an explicit for loop, we're having in some sense in an implicit for loop where the body of the function is run many times, and the index of any particular block that is written in the body of this function can use this block. Idx.X variable this is defined by the compiler at run, compile time and runtime for you. So it's part of cuda.

A

It's not a standard c variable. But if you do b of block index.x equals b of block and x dot x, plus a of block index set x, then that will do the corresponding update of a given element of these of this array b at the index corresponding to block index.x, which will be unique for every block. That is running the body of this function and it will be zero indexed, um at least in c.

A

So there will be block indices from zero to n minus one. Where n is the number that you gave in this execution configuration? So whatever number n that I choose here in this execution configuration there will be that many blocks running the body of this function and if the length of the arrays or vectors is n, then this will give us exactly the number of items that I need to perform.

A

This array addition or vector addition in parallel.

A

So here's what a main would look like if I am calling this um function or launching this kernel on the gpu, I would define some arrays a and b and then I would have to allocate memory for them now. The way that we're going to choose to allocate memory in this example is using the memory allocator, cuda, malik, managed, kudomatic managed means.

A

I want to allocate the array a and b to have memory that can be accessed on either the cpu or the gpu. We talked about this a little bit yesterday. This is the same strategy that we use for standard language parallelism and can often use for openmp and openhc as well. If we want to where we allocate memory that is accessible on either the cpu or the gpu and wherever it's accessed it will automatically migrate to in order to be used there.

A

It is a little bit different syntax from malek, because you give it the address of the pointer rather than setting the pointer to the return value of the malloc call, but it fundamentally does the same thing where it allocates a size in bytes and then um updates the pointer to that location to that allocation.

A

We've allocated memory that memory can be accessed on either the cpu or the gpu, we're going to fill in some data to the arrays, a and b, uh with perhaps with some random data, and then we're going to launch our kernel on the gpu with n blocks.

A

Remember n in this case is the number of times we execute the body of this function in parallel, so we want it to be equal to the length of the arrays which it is in this case.

A

Each invocation of this parallel function will do one index in the array b and then be done, and then we can clean up by deallocating the arrays a and b oops. I have a typo there's no cra in this example.

A

Okay, so this would be a fully worked out main that does the work of launching this kernel on gpu and adding array a to array b.

A

Now a block can also be split into parallel threads. We could then update the add function to use parallel threads instead of parallel blocks, and then the body of the function would look very similar, except now. The index that we would use for determining which element of the array b that we're going to update will be based on thread index.x.

A

So this is a new variable which is similar to block index.x, and that is a zero index variable which selects, which of n threads in a block. I am those threads in general can run in parallel.

A

If I want to use threads as the parallel um or the level of parallelism that I use threads within a block, then I would do one comma n instead of n comma one when I write the execution configuration that launches my parallel add function.

A

So we've now given two examples: one where we paralyze over blocks and then another where we paralyze among threads within a block.

A

Now I'm going to talk about combining both blocks and threads within a block to do a tool, a hierarchical parallelism for doing this effect tradition. I will explain why we might do this in a second. Why would we make it more complicated this way when either one of the previous approaches that I've discussed is valid? Why would we do that I'll? Come back to that? First, I want to say something about indexing of data.

A

If we are using both two levels of parallelism, both blocks and threads within a block, then the indexing into the array that we're targeting gets a little bit more complicated.

A

Let's assume for the moment that every block has exactly eight threads within that block.

A

Every block has unique index from zero to n minus one, and every thread within a block has a unique index from zero to m minus one. Where m is the number of threads per block and that thread index within a block would then be replicated across all blocks. So if we look at block index 0, it's going to have threads from 0 through seven block index. One will also have threads from zero through seven and seven the same for block index two and blocks index three.

A

If we were to lay these blocks sequentially so that block index one lays right next to block index, 0, etc. All the way up to n minus 1, where n is the number of blocks.

A

If we were to lay these things out, then we would notice that this gives us a number of workers to work with. That is equal to the number of blocks times the number of threads per block. In this example, there would be a total of 32 parallel threads, summed across the entire grid of blocks.

A

And if we choose to array them in this order, then we can also give a unique index for a thread within the entire grid of threads. Using a formula index equals thread, index plus x, threading, as dot x, plus block index dot x times m. Where m is the number of threads per block?

A

If we think about this, what we're saying is that each block has some number of threads and represents a total offset into an array that we're going to access and then the thread index is the offset within the block.

A

So if we combine an offset corresponding to the block, we are which reflects how many threads there were in the grid total prior to this block, and then we combine that with an offset within the block that gives us a unique index within the grid and again you get threat in x, dot, x, plus block index times m. Where m is the number of threads per block?

A

This formula will give a unique index for any thread which is not shared by any other thread in the grid.

A

So um with that in mind, let's highlight an element in the grid in red and ask the question which thread in the grid would operate on that element in the array.

A

Well, let's assume that it's the element at index 21 in the array- and we want to ask the question which thread would operate on that, so the way that we will calculate this is by advancing eight threads at a time, because that corresponds to the number of threads per block in this toy example, and then we will keep on advancing until we get to the block which contains that element in the array.

A

We will then advance through by threads until we reach the course the target element. So then that will correspond to block index dot equals block index equals two, because that's the block where this red element falls in and within this group of threads. It's the thread with index 5 that happens to correspond to the array index 21.

A

If we then do the math, we should verify that if we set thread at x equals to five and block index of two and then multiply that by the number of threads per block, that should equal the desired target slot in the array 21, and it in fact does. If you do, the math here, you'll see that that equals 21 as desired.

A

Okay, so we've now said: we've now written down an algorithm for how we would identify what parallel threads map to what elements of the work or the array either, and we said that the last piece of information we need other than the thread index and block index is how many threads there are per block. Cuda also provides a runtime built-in variable block dim dot x, which defines the number of threads per block.

A

So that's the the value that I was calling m in the last slide. So if I substitute that for the runtime variable blockdim.x, then I can get a unique index within the grid.

A

With this idea, then I can update the definition of my add function to combine both parallel threads and parallel blocks. I'll define a unique index within the grid index, equals threat, index, dot, x, plus block index times blocked in and then I will say, b of index equals b of index plus a of index. So it's the exact same actual code that we used before we're just defining the index a little bit differently, we're combining both parallel threads and parallel blocks, to define the index into the array that we will use.

A

Okay. So that's what we want the kernel to look like what changes have to be made in main in order to support this.

A

How do we launch this kernel so that it does have parallel threads and parallel blocks so that the body of those function will actually make sense.

A

The part where we allocate the memory stays the same. We initialize the data. The part that is different is the execution configuration. This is the part that controls how many parallel blocks and parallel threads per block are launched on the gpu.

A

What we're going to do is define a variable, an integer number of threads per block. This is something that you have to control as the programmer. It can have some number of values as brent described.

A

It is closely related to the number of threads per team or per gang that um were launched in though in the directive based models, and it there are some values that are better than others. For example, 128 is a pretty good value, um but it's a value that ultimately you control and have to define for yourself. The compiler does not help you.

A

You are required to select this value in order to make this all work, you select a number of threads per block and the requirement of cuda is that every block must have the exact same number of threads.

A

If every block is the same number of threads, then we can quite easily calculate the number of blocks that we need by simply dividing the number of elements of work to do by the number of threads per block.

A

If we have n elements of work to do, and every block has threads per block threads, then the number of blocks we need is just n over threads per block.

A

Of course, you'd want to make sure that you do a ceiling division in case um the n does not evenly divide threads per block in the example uh in the repo that you do later, for the hands-on will cover that. But fundamentally that's what we do.

A

Now notice how this is different from the previous models that we chose. You are now telling the gpu exactly how many blocks and threads you want it to run, and you are required to do the work of mapping those blocks and those threads to do. The work that you want. Kudo doesn't have any guardrails in this sense, you are in control of exactly what the gpu does, which gives you a lot more power, but also a lot more responsibility to get it right.

A

B

A

um The in a typical case, you're not going to get lucky enough or you often may not get lucky enough for the size of the array to be an exact integer, multiple of the number of threads per block or blockdem.x.

A

So generally, when you write safe code, you'll want to put an if condition, which says only if index is less than n the number of elements in the array. Only then do I do any modifications to the array, and that often requires you to pass as an argument to the kernel, the length of the array to make sure that you don't access off the end of it, because in c we don't really.

A

We don't have that information from a pure pointer unless we provide some additional context to the function, and so this is the updated ceiling, division or rounding up division, which or one way to do it, which guarantees that you will always have enough threads to work on. You do n, plus m minus one over m. Where m is the number of threads per block and that will always basically guarantee that you divide by m but round upward. So you'll always have at least as many blocks as needed to do all of the work.

A

And then you pass that value n. The number of elements of work to do or the number of elements in the array to the function definition.

A

um And then you will uh be good. There will be potentially more threads launched than actually needed.

A

But that's okay! There probably won't be that many if n is sufficiently large. One block at the end of the grid will have a few idle threads, not the end of the world, so this is, in general, a safe and general way to write your kernel launches and the way that professional cudico tends to be written.

A

So very last comment, then: why do we even have this hierarchical parallelism if we can either use blocks or threads as I described? Why should we have two levels of parallelism? You could either ask: why do we have blocks or why we have threads? What do we gain?

A

I can answer that in two ways. One of them is to say that threads within a block have direct mechanisms to communicate and synchronize with each other in a way that blocks themselves don't have as easily a mechanism to communicate with each other in so that gives you a way to do interesting, new kinds of algorithmic choices that rely on explicit communication or synchronization between threads that are not available to you. If you only use blocks, of course, you could just flip the question around well, then, why do we have blocks?

A

And ultimately, the answer comes down to the the way that gpus themselves are designed. Gpus are designed, and brent talked a little bit about this.

A

Yesterday, they're designed by taking fundamental building blocks called streaming multi-processors and stamping them across a chip in order to get massive amounts of parallelism, locks ultimately map to those multi-processors, and it's those multi-processors that give us the special hardware that we can use for doing inbuilt communication synchronization and because this is really the only effective way to write to make a processor that is scalable and not super expensive for us to make and for you to buy that's how gpus will be written made for a long time, probably ever and that's why it's a requirement for us to understand this hierarchical parallelism.

A

That being said, um it doesn't require that much extra work right, we've seen that you can often pretty much ignore that you can write the the kernel as if there's a single dimension of parallelism in many cases and just get a single index, unique connector in the grid and use that to do some work. But the fact that there is a two level hierarchy um is relevant for both performance and doing certain kinds of algorithms, and it is ultimately the same reason by opening acc has gangs and also you know, vector lanes within a gang.

A

It reflects the same design pattern in how modern parallel processors work, and so it is relevant from a performance perspective. I'm going to stop there because um the you know you could go on for cuda for a long time, and I really just wanted to give you a flavor of what goes on in cuda. I will leave some links to here to introductions to huda if you want to learn more and in particular, a resource that I strongly recommend.

A

If you want to learn more is that we did a extensive training series in conjunction with oak ridge and later nurse and argonne on introduction to cuda, and it's probably the best resource on the internet for learning cuda and I'm not tuning my own horn there, because I'm not the one who gave the presentations. But my opinion is it's one of the best public presentations. There are on cuda great resource for you to learn from and in fact the slides that I'm using today are adapted from uh that training series.

A

So I'd recommend that you check that out. If you'd like to learn more about cuda with that I'm going to stop- and let me know if there are any questions.

B

Thanks a lot max: um yes, I conquered that cuda training series is very comprehensive, and uh so it's good if we have a chance to review them and there are also hands-on access there as well, but max um 30-minute training with the hands on today would also be very helpful.

B

One question in slack, for you is the number of blocks determined by the gpu hardware, slash chip configurations.

A

The number of blocks is determined by you. You are completely in control of how many blocks are launched.

A

Now you might ask the question fairly.

A

Should I determine the number of blocks based on something that I know about the hardware, and the answer is yes, you should, um generally speaking, you want to think about how many multi-processors there are on the gpu and have at least as many blocks as there are multi-processors, because ultimately, every block maps to one of those multiprocessors brent also described yesterday. The fact that our multi-processors can themselves do parallel processing and be working on multiple blocks at a time. So actually you'll do even better.

A

If you have many blocks per multiprocessor rather than just one, there is actually an upper limit on how many can be in flight at one time and by in flight I mean actually doing parallel work.

A

um There is also a much much higher limit on how many can be queued up at one time, which most people will never hit in practice, because it's basically a billion, but it is worth knowing something about the hardware uh when tuning that choice, however you're not required to- and you can choose any number up to the maximum limit that cuda allows you and get correct execution. As long as you write, semantically, correct huda code.

A

Choosing the number of blocks map in terms of thinking of the hardware is a performance optimization, but doesn't have to be a correctness. Optimization.