National Energy Research Scientific Computing Center (NERSC) An Introduction to Programming with SYCL on Perlmutter and Beyond, March 2022, 1 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 4. ND Range Kernels

Description

Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/

A

Okay, so this is about nd range kernels and um more specifically, kind of advanced uh advanced features like using local memory. So there's a bit of duplication here so I'll skip into those.

A

So we want to learn about the signal, execution and memory model. We want to learn how to enqueue an nd range kernel, uh any range kernel functions and how to use local memory so again, uh yeah the fundamental unit is work items. This is kind of taken from the previous slide in the previous section. These are organized in work groups and then work groups are organized in nd ranges.

A

Okay, so again, nd ranges are defined by the global range, not by the number of work groups. Okay, that's quite important, so this endy range has say: 8, sorry, no 16 threads in that direction. 16 work items in that direction. Two.

B

In that direction,.

A

So this is a 16 by two um nd range and then the local range is just four by one uh sorry um yeah. I suppose it would be four by one yeah.

A

Okay. uh This is sorry. This is taken from the previous slide as well. So an nd range could be one two or three dimensions.

A

So these individual ranges here can be one two or three dimensional okay, but you only have two parts: you have the global range and the local range, which is the size of the work group, so these individually could be one two or three dimensional, so this is convenient depending on the problem that you're working on.

A

So uh this particular say work item in this nd range, so there's a global range which is 12 12, there's a global id. So this work item has global id six five. So it's zero one, two, three, four: five: six, zero one, two, three, four: five. This work item has global id 65. The group range okay. So this is the number of groups, okay or the the number of blocks if you're used to dealing in cuda. So this is three three the group id.

A

So what's the id of this group, so it's zero one and then zero.

C

A

C

Local range, which.

A

Is the size of the work group and then the local id, so the local id is zero, um so zero.

B

A

uh Sorry, I should be going right and then done. Zero.

C

A

And then zero one says two one: okay, this is actually um yeah, we'll talk more about this in the later slide, but yet the way that you index into these ranges is important uh and it's actually the opposite of what it is in cuda, which is uh yeah important to know.

A

So a signal execution model- typically an nd range invocation um in an nd range invocation sickle- will execute the sql kernel function on a very large number of work items, often in the thousands. So this uh this allows us to achieve um good, occupancy, good kind of use of our compute units and say a gpu or some other offloading device.

C

Yeah and we execute concurrently.

A

Yes, this is sorry bit of duplication from the previous one. uh We can again synchronize across a work group using a barrier okay, so this is especially important if we're dealing with local memory, which we're going to be detailing in a bit okay, so again, uh private memory. Each work item has memory which is completely private to it. um So this is like a register or something we also have local memory. So a work group shares the local memory.

A

So this work item can write to local memory, which can then be read by another work item. We need to make sure that if that is to happen, that we use a barrier to make sure that something a right has indeed finished before the read happens and yet very important that a work group, a work item in a worker cannot access the local memory of another work group. They can only access its own local memory, okay and then again, global constant memory, okay, yeah, so private memory, very, very fast.

A

Local memory is pretty fast. I think, um theoretically, between a few cycles and maybe 10 cycles, to get to local memory, whereas private memory is theoretically, maybe one or two cycles uh that it takes to load this, whereas global and constant memory is within the hundreds of cycles. So it's a lot slower. So if we can use local memory um for, say intermediate computation, we should really really do this, um whereas global memory is of course necessary.

A

For you know, the initial reading in and writing out of values uh and global memory is larger than local memory, and local memory is larger than private memory private memory. So this is yet the kind of speed versus size hierarchy. So this is small, very, very fast, um relatively small, relatively fast large and slow.

A

C

A

We're interested in this time constructing a parallel four with an nd range. Okay, so we're only going to be dealing with one-dimensional nd ranges in this uh workshop, so an nd range is made up of the global range okay, which in this case is 10 24 and the local range the size of your work group. So this must divide that.

A

Okay, and if you have say a two-dimensional, three-dimensional range global range, local range, then each say the the x coordinate must divide the x coordinate here. The y coordinate must divide the y coordinate here, the z, the z and so on.

A

Then we can use our nd items uh to get things like the the id number um uh the linear id. uh We can also get things like the the group, the group id. We can get the the local id we can uh do a barrier. We can do a lot of things with our in the item.

A

Okay, uh yeah, so on most hardware global range. I must be a multiple of local range. I so this is not okay, okay, we won't be able to construct an nd range with this, because 64 does not um uh divide into 1000 okay, so this will give us an error. um This is okay, this divides.

A

Okay, and you can see that 64 divides into 1024 and 64 is also a factor of 64..

A

Okay um yeah. So for nvidia hardware, work, group, sizes or block sizes are best chosen from uh well, usually not 8 or 16, but usually 32, 64, 128, 256, 512 1024.

A

Okay, so using local memory, so this um this is a really kind of uh important thing. If you want to uh write performing code, definitely um so in sickle, local memory is called uh sorry sickle.

A

Local memory is called shared memory in cuda, as you uh we've kind of mentioned already to use local memory, you must use an accessor okay, so we've glossed over the buffer accessor model, but for this for local memory and also for things like console memory to texture memory, you need to use accessors and the way that we define essentially that it's local memory is. We say we construct an accessor with sickle access target local.

A

So we didn't really have to worry about this q dot submit beforehand, so we're just dealing with q single task, q, dot parallel four. So when you're trying to um when you need to uh dictate the way memory is managed uh within your kernel or the memory that needs to be used within the kernel, then, essentially, you need to define this memory how this is going to relate to outside memory um or or not. In this case, uh this is purely contained within the within the the the q dot submit.

A

This local memory can't really go anywhere, but you need to do this within a submit function. You can't just do it within a say, a parallel for, because the the memory kind of needs to be set up beforehand, which is similar to saying cuda. If you have a dynamic, um shared memory size, you need to configure that when you're calling the actual kernel or you need to but yeah yeah, um okay, so you're constructing an accessor which just says essentially give me a chunk of memory.

A

Okay and you can access the successor like a pointer, you can access this like a uh just just a normal pointer. um Well, because this is one dimensional. If it were uh two-dimensional, then you need to do uh some more things, but yeah, one-dimensional local memory is sort of maybe the way to go. So this is local memory of size, local mem size and it has type t okay, and then this can be used within my kernel.

A

C

A

Yeah, instead of directly and queueing tasks to the queue, you must submit a command group and use. So this is the the handler for the command group to which you kind of give these memory requirements, and you then subsequently submit your parallel forward to the cgh.

A

So this is something that kind of has to so once you do, a q dot submit uh every subsequent kind of operation needs to be to the handler for that command group. So this is the command group, so it's a command.

C

A

Some memory options yeah and this q.submit can only contain um one command, so a command being a parallel four, a single task, a mem copy and so on. You can only do this once within within a q.submit, uh so we use a submit when we need to do stuff with memory. But if we don't need to do anything with with memory with accessors, for instance, then we it's it's easier to use the q dot mem copy q dot so on, just because it's less for books, um yeah, so one command within each.

C

A

Okay, so there there's a lot of there's a lot of information in this one slide. Does anyone have any questions about this particular slide? Because there is a lot here.

A

Okay, I'll keep going.

A

Okay, so again, yeah we're saying the access target is local and then you can treat accessors like pointers from within kernels, so use this operator. Okay, so locomm is declared here with a particular size. Okay and as well. Cgh needs to be passed to the function as well. This command group handler and then once we uh once we're doing our kernel, then you can just use it as a pointer just use it as a normal pointer. That's fine.

A

Okay, so um now for an nd range, uh when you submit a parallel four and you pass in an nd item. Okay, you don't need to necessarily pass an ndi item, but it's useful. If you do, then the nd item has some really useful. Member functions like get global linear id okay. So this gets your global linear id as as, if you're indexing into a one-dimensional um array, uh your local linear id.

A

So this is just your within your work group, your linear id um and also the the group id there are lots of different things. um Let's just imagine as well that we're writing to local memory uh with our local id with some value, and then uh let's say we want to use this value at some other point: we're going to use I dot barrier. So this is a member function of the nd item, so this is really important if you're using shared memory, if you're using shared memory, usually every time you.

C

A

And you want it to be read somewhere else. You need to make sure that you do. A barrier uh barrier is also a member function of a group, so you can also do that.

A

Okay questions. We.

C

Have a question in there slack.

A

Okay, what is the motivation uh of organizing memory in blocks of.

D

A

Memory um so we're we're utilizing the hardware, essentially so each um block each work group has access to this region of memory, that's quite fast, and it can be shared between work items. uh So if we can use this, if there are some computations that can be done, that utilizes shared memory, essentially we can share. This is the only way that we can share data among work items.

A

We cannot share data among work items in separate work groups, so it's only through this shared memory that we can, you know, have rights and then followed by reads from different work items. We're gonna, we're gonna, see um a kind of simple example of how we uh use this to try and to try and optimize things in um the final, the final exercise but yeah. Essentially the hardware is there, so we want to use it uh it's fast and it's if you can use local memory, uh your your algorithm, whatever.

C

A

Is uh it can be, you know, words of magnitude faster than, if you're naively, using global memory. uh If it's suitable, not every task necessarily uh needs. Not every task requires shared memory, but if it does, then you should use it.

B

But just for clarification, what am I going to do if I have a distributed memory machine so with some gpus? How do I actually handle the memory? um Do I have to use mpi subroutines to keep transferring the data back and forth and the sql only for accessing the gpus or how do I handle that situation?.

A

This is a really really good question. This is something that gordon is definitely more qualified to answer than than I am. If he's still on the line.

E

Yeah yeah! That's why I'm here so can you repeat the question.

B

Yes, so um if I have a distributed machine uh which actually has on local nodes, has let's say some accelerators gpus or I don't know fpga or something right? um How do I handle the memory between like the communication, I need to use mpi and then then just transfer the data back and forth between the host memory and the the device.

B

Do I do that in sql, or is it something that I use external to the program? How do I do exactly that.

E

Yes, that's a really good question, so cycle is uh sort of like single node programming model. So you would you generally use something like mpi to communicate between nodes um and run like a an instance of cycle with dpc.

E

Yeah generally, in the past, we've used api for multi multi-node systems.

B

So then, to compile those I I write my code. um Essentially, I can write in sql and just link and also load. I guess the mpi library right.

B

E

C

E

um And yeah you could have.

C

Background and.

E

Dpc source works well with with mpi we've done that before sorry. Sorry, oh.

D

Yeah, I'm just gonna say we um at this exact moment: we don't have a module for for mpi built against the sickle compiler.

D

um But uh if you're wanting to try that out, I have um kind of a proof of concept but largely untested, build um that should be able to properly use the high speed network on promoter um so yeah. I I can I'm happy to have a conversation over slack or or, however, um if you want to take a look at that and then.

C

I can also just.

D

Comment: I'm aware that there is a a research project called celerity which is um sort of, I guess billed as is kind of like a sickle for uh distributed memory platforms, um and I mean that that's kind of the extent of my knowledge of that other than it's a it's a research project and it's.

C

Kind of outside.

D

Of the scope of um what we're targeting at the moment,.

B

That's great, thank you.

A

Why is it limited to 3d? um This is a good question, so this has just got to do with the signal. Spec, uh I'm not entirely sure what the why it was uh loaded at three dimensions. So.

E

So the reason for three dimension was mainly because it kind of inherited that from opencl as it was kind of started off as a an opencl and a high level model, but there is kind of proposals in the works to increase that to higher dimensions.

E

I'd expect that to be supported in the future.

A

Okay, there you go okay, so we're going to.

C

I had a thought you and like uh an interest of timing kind of getting through the last section. Maybe it would make.

A

Sense to skip this exercise.

C

A

Of walk through.

C

This solution and kind of explain, I guess, a bit of that, and then that means you could kind of move on and yeah. You can.

A

Do the last bit, do you think that makes sense yeah yeah yeah we'll fly through this okay, so the the task is essentially to.

A

The task is to to flip an array. Okay, so you have you: have an array: reverse the direction? Okay, so you could naively um I'm not gonna write all this out, but you could naively cue that parallel before.

C

A

Okay, so we just do a sickle range.

A

Okay and then we do, I don't know global range.

A

A

A

Okay, so the naive.

C

Way of doing this,.

A

Okay, uh the naive way of doing it so we're just reversing the the the vector is um the pointer b and then we would do something like, um but first we need to get some more images. Oh, this is much nice. Okay,.

A

Functions: they'll just return a size, t. Okay, then we can just say yeah. Let's have that pointer, be b.

A

A

Maybe a global range would be where we could get global range. Okay, global.

A

A

The the idea is uh you just flip the the array, so we just do.

A

So this would you need to specify this is a one dimensional indium.

A

uh Index is equal to um the printer a so this is the input and then that would just be global, idx. Here's the id is that's the naive way of doing things.

A

But if you use shared memory, so essentially you'll be accessing this in a reverse way. Okay, so each work item will be accessing instead of accessing, like in a where work item, say, n or work item. I is accessing point I and work uh item I plus one is accessing uh the they're indexing into I plus one. It's going the opposite way, so this is actually probably fine or is it no? It definitely is fine by modern um kind.

C

A

Cuda memory managers to index into things in a backwards way in a flipped way, but um kind of depends on the device. If you're working on an older device, then it wouldn't necessarily be as optimal. So it would be beneficial to use shared memory as an in between so we'll just look at the solution to see.

A

A

Solution.Cpp, okay, so, instead of just doing it the naive way we're um allocating some local memory.

A

We are then loading the local memory uh with the the global value, then we're executing a barrier and then we're writing back to global memory in a completely aligned way, so we're not flipping it uh and then we're riding back from local memory. So we'll we'll talk about this more in the upcoming section, but essentially, if you can access global memory in as uh uniform and contiguous uh a way as possible, this will give you much better performance.

A

um But we'll we'll talk about this more, I haven't really explained exactly what um what uh I mean, but if we try and get work item I and I plus one accessing adjacent points in memory, um then that will give us better performance. In this case they actually are accessing, say in the naive one. They are accessing adjacent points in memory, but they were flipped.

A

So I plus I and I plus one work items I and I plus one, were accessing um elements: j and j minus one instead of j plus one and an older hardware, that would have been a problem on modern hardware. It's not a problem, but it's kind of a very simple use of a of local memory.

A

Okay, so does anyone have any questions.

A

So item.barrier this is ensuring that all of the work items in a particular work group. They wait until every work item in that work. Group has reached this point and then they proceed to the next step. So in this instance this is a really really good question. So in this instance um we're writing to local memory.

A

Okay, and then we want to read from local memory if we didn't have this item.barrier, because this work item is writing to local idx and reading from workgroup size, minus local idx minus one is reading from a completely different space. It's reading from uh something that's been written to buy another work item. We need to be able to guarantee that the other work item has finished writing to that space in uh in local memory. So the only way that we can do that is by using a barrier.

A

So this just synchronizes all of the work items in a work group, but it's only within that work group, not within the larger device or anything. We have no way of doing that on the larger device, except for just having one kernel and then finishing the kernel doing a new kernel, so the the overhead. um There is some overhead uh in that.

A

If, if work items have diverged slightly or so on, cuda hardware um work groups are organized in warps, which are groups of 32, so warps might be executing a slightly different speed. So it might just so happen that this warp is slightly ahead and it's you know running ahead by a few cycles or whatever, and then it needs to wait for this other one to catch up, but um the performance gain of using shared memory is really worth this barrier.

A

It's worth the you know having to wait. This say warp or um but also in warps, because you have independent forward progress. Sometimes within work groups uh there can be diversion control flows, so you do need to um you. Do need to call weight as well within warps, but um yeah. The the overhead is worth it essentially because you're using very, very fast memory.

C

A

uh Look mary uh different devices: um no, there are not uh the main host memory, um so certainly yeah, not with local memory. um Yes,.

C

A

So, by by local memory, if you mean cuda shared memory which is uh requires this synchronization, then you can't share that among work groups on the same device, let alone different devices.

A

So I guess you're saying that the barriers keep processed from writing to the same memory.

A

um So not necessarily from writing to the same memory. um It's just you have two things happening, asynchronously so essentially um or not asynchronously, but concurrently, so one is going to write and then essentially you want this one to read as soon as it has been written. This is a way of enforcing that, but there's this event that every um work item needs to get to before it's allowed to read it's not uh it's not necessarily it's not about processes um and uh we're not thinking about the operating system.

A

Really, here, that's more um got to do with uh yeah we're, not thinking of of operating systems. Here we're just thinking about device code essentially. uh Well, I I'm not sure how operating systems interact with you know offload devices in the first place, but um no all of this is allocated within the program anyway uh and within an individual program.

A

Any like any work item can write to any any part of memory because it is within the process. It's all one, big process, work items do not represent uh different processes.