National Energy Research Scientific Computing Center (NERSC) An Introduction to Programming with SYCL on Perlmutter and Beyond, March 2022, 1 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 3. Data Parallelism

Description

Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/

A

Okay, so data parallelism, so this is what it's all about really, um yes, so obviously we want to use offloading devices because it allows us to uh do stuff in parallel, so yeah. This is important.

A

So in this uh section we're going to learn about task, parallelism and data parallelism learn about the spmd model for describing data parallelism. So this is the single process. Multiple data model, this kind of flynn's, taxonomy thing above um learn about single uh sickle, execution of memory models, learn about inquiring kernel functions with parallel, four okay, so task parallelism is where you have several possibly distinct tasks executing in parallel in task parallelism. You parallel parallelism, you optimize for latency. We want low latency data.

A

Parallelism is where you have the same tasks being performed on multiple elements of data in data parallelism you optimize for throughput, so we're mostly kind of dealing with uh data parallelism. Here many processes are vector, processors, which means that they can naturally perform data. Parallelism gpus are designed to be parallel, cpus have simply instructions which perform the same instruction on a number of elements of data.

A

Okay, so you can see your normal sequential cpu code.

A

Doing some sort of a loop, okay and then a parallel spmd code uh and then we're just defining this for a single iteration in this kind of iteration uh space and you're doing a parallel four for this okay, so in sickle, kernel functions are executed by individual work items. So these are the smallest unit of work.

A

Okay, work items, so this is equivalent to include we might call them threads. You can think of a work item as a thread of execution. Yeah. Each work item will execute a sickle kernel function from start to end.

A

A work item can run on cpu threads, simply lines gpu threads or any other kind of processing element yeah, okay, so work items are collected together into work groups. Okay, so um again uh with your cuda.

B

A

On this to be equivalent to threads and blocks okay, so this is some work. Group is a collection of work items and you can dictate the size of this work group.

A

So the size of the work groups is generally relative to what is after on the device being targeted. um Yes, you don't need to specify work group size, but you can manually specify uh if you want, but if not, then there are some heuristics to choose good good work group size.

A

It can be also affected by the resources used by each work item. Yes, okay, so sickle kernel functions are invoked within an nd range. Okay. So essentially, this work item is grouped in a work group and then this in turn is kind of whatever inside uh this. The the next step up in the hierarchy is the nd range, so in cuda this would correspond to threads blocks and then a grid.

A

So an nd range uh is composed of the dimension of the work group, as well as the global dimension, the global size, the global range. Okay, an nd.

B

Range has a number.

A

Of work groups and subsequently a number of work items, work groups always have the same number of work items yet so you can't have say I don't know a work group here with x and another work group here with why work items they need to all agree and as well when we're defining an nd range.

A

We do it in terms of the global range and the work group range, and we need to make sure that the work group range divides. This global range, or else you'll effectively be asking for an irregular number of work items per work. Group.

A

Okay, so this is um an instance of an nd range okay. So this is the the global range 12 12.. So we're not saying how many work groups we want we're saying how many, um how many work items in total in the global space we want.

A

So we have one two, three, four, five, six, seven, eight, nine ten, eleven uh or yeah, sorry, twelve and then the same in that direction. So the number of work groups is inferred. um It's not handed to the constructor for an nd range. uh It could be one two or three dimensions. That's that's important! So each of these could be one two or three dimensions.

A

um It has two components: the global range this bit and the local range. Okay, then the size of the work group essentially.

A

So multiple work items will generally execute concurrently, uh yeah so usually on on offload devices. It's useful to imagine that these are all happening. The these are executing in complete, lock, step. There's an ex. There are exceptions to this, but uh this is a good thing to have in your head when we're writing uh code for offloading devices.

A

So on vector hardware, this is often done in lockstep, which means, in the same hardware instructions. The number of work items that will execute can concurrently can vary from one device to another work items will be batched along with other work items in the same work group.

A

The order, work, items and work groups are executed in is implementation defined. This is important, so essentially you could have a large uh nd range where you think. Well, it's it's nice to um uh uh kind of understand the work items as all executing in parallel, but in fact you might have these uh work groups executing and then those work groups and then more work groups and so on.

A

So you need to make sure that any rights to say global memory or um uh yeah that that you're you're not going to be you need to be careful. We can't make assumptions that things are happening exactly at the same time.

A

Okay, so work items in a work group can be synchronized using a work group barrier yeah. This is really important. So if you have some work items in a work group executing, um they might sort of fall out of lockstep and also, if especially, if they're larger work groups uh they might be executing concurrently um or at the same time, at all so imposing a barrier uh means that essentially all of the work groups need to arrive to the same point before they can get to the next step.

A

So this is done with the the the function barrier, which is a member function of item or a work group um sql does not support synchronizing across all work items in the nd range. Okay. This is also important, so we can't have a global sink of all work items in our nd range. um If this is something that we want- and this is something that we'll be seeing actually in the final exercise- if this is something that we want, then we're better off.

A

Writing two separate kernels, so kernels um writing uh different kernels splitting the computation across multiple kernels. That's a way to guarantee some sort of global synchronization. You wait until something is completely finished. Then you do a new kernel. You submit new kernel.

A

So each work item can access a dedicated region of private memory, so this is kind of like the the registers that would be local to a particular work item. This is obviously very very fast.

A

It can access local memory, which is shared among a work group, so this in cuda is shared memory.

A

So if this work item here writes to a value in local memory, then this work item can read it, but we need to be very, very careful that this work item is only reading it after this one has written to it. So if we're writing- and we want it to be read later- we might write then do a barrier to make sure that everything is has happened and then we we might read with with the other work item.

A

um We also have global, constant memory, which is when we do a device malloc. This is the the standard, um well global memory.

C

A

The standard we can ask for constant memory using accessors. uh We need to ask for local memory using accessories. We'll do this at the end of the the next. um The next lesson: okay uh and the cuts of memory is read only, but we're not really going to be dealing with that at the moment.

A

uh Also. So, let's just say that this work item writes to global memory and then this work item wants to read that exact same uh thing: that's been written into global memory. We can't do that safely, uh there's no way of doing that in a yeah in a safe manner.

A

So it's better to split this up into two separate kernels, where in the first kernel this thread right to that and then the second kernel that value is read by some other thread by some other work item: okay, so a parallel four okay, so um we can so this is a member function of a command group handler, but we're just using a cue in our examples, so you define it on a range okay. This is just a normal range.

A

It's not an nd range, so an nd range so think nested, an nd range corresponds to defining the global range and also the work group size in this case we're just interested in the global range. So this is not an nd, an nd range so um into this uh member function. You're sorry into this lambda you're capturing things by value and you're, uh defining an an index, um an index argument, and this index is really useful. This index can be essentially tells you the the position of the thread uh within this.

A

This range of threads, um yes, uh okay, and we can see as well that this is a two-dimensional range and, as a result, our ids are two-dimensional as well. Okay. So this is some two-dimensional object. We can get the individual dimensions by using just array: access, zero and array axis one: okay, um yes, okay, so this is taking a single id and this can be used to find its position within the iteration space.

A

Okay, so um this is a parallel for taking a range object and this is one-dimensional. Obviously, so the id is one-dimensional as well. Okay, um we can also. uh So this is a sickle id. This is just going to be a kind of sort of like a tuple which is either a one two or three tuple, um which tells you the the position of the thread within the space. um If you wanted to, you could also get a sickle item object, and this has a little bit this.

A

This has the same functionality as an index, but it does even more um than an index. So, for instance, item has the member function barrier that you can use and that's really useful, and you can do lots of things with items that you can't necessarily do with indexes so uh yeah.

A

I would point you to the signal spec to look at all the member functions, which are great okay um now, in the final one, so we're using an nd range, so a nested range, so you have your global range, so the entire global range is going to be 10 24 by 10 24.. Sorry, no, it's just a single, a single dimension, so it's just 10 24.

A

and then the work group size is going to be 32. The local range is going to be 32 and then you pass in an nd item which is similar to an item. I'm sorry, I said that item you can use a barrier. You can't use a barrier with a with a normal sickle item. With an nd item. You can use a sickle barrier because this means essentially synchronizing uh in a particular work group.

A

We don't necessarily have the concept of work groups here, because we're not defining a an nd range, we're just defining a single range, okay, um yeah and we can get lots of nice stuff from this nd item. So this yeah again points you to the sickle spec, okay, questions.

D

Are there white shuffles.

A

Yes, yes, there are yes um yeah you can. You can find this this in the spec. uh Certainly.

A

Where the this is, this is outside the scope of um uh the workshop. Today, almost all um cuda features uh are implemented in dpc plus plus and the ones that are the newest ones. These are the ones that we're currently working on so they're they're, like we are quite fast to to implement things, definitely um yeah uh definitely consult the spec for week. We can post this, maybe in the um in the slack.

D

Oh we'll do it I'm going slide. 60 looks different from the pdf.

D

A

B

A

Don't have access, I don't have the pdf in front of me at the moment.

A

Okay, any questions.

B

So if you are going to have, if statements in the kernel are they executed all in the style of the kuda.

A

um That's defined by the implementation um in sickle code. If you have work items uh reaching a barrier using kind of different branches, this is actually undefined behavior, but within um with the dpc plus plus back end, uh it's it. It usually agrees with the the cuda behavior uh gordon. Is that correct? Am I.

C

Yeah, so this is um this is something we're kind of working on trying to expose a bit a bit better and then they could have back into the moment. So the sickle spec currently works under the assumption that it doesn't make any guarantees about the execution of work items within the work group. They can, you know, make progress in kind of in any way they like so for kuda. You know um lyrical architectures.

C

They they can move with independent forward progress, um but when it comes to certain operations like group based functions, so like work, group level or subgroups or warp level functions, these generally require often require convergence or synchronization and obviously in cuda execution model. Then you can have uh there's a lot of features where you can have um individual threads. You know bad errors and copies, and things like that happen for individual threads rather than in warps.

C

um That's something that's not exposed in https, plus or cycle at the moment, but we're going to work on that, because it'll require kind of expanding the current cycle execution model in order to kind of be able to describe that those capabilities.

B

Yes, thank you.

A

We might jump into the next exercise so essentially.

A

Okay, this is just a simple vector, add um just using parallel force, so we don't necessarily need to worry about. um Nd ranges, just just the global range will be fine. Okay, we have two vectors and we want to add them on device. Okay and we'll be checking the results at the end. Okay, so compute this in parallel on the single device, so we need to construct q, allocate memory copy memory to device use a parallel for it to add the two derays transfer, the memory back to the device so yeah.

A

We don't need to worry about an nd range in this case, just a global range, so um yeah, and then it might be worth mentioning as well that a global range we've been constructed, saying something like.

A

Okay, yeah, so we'll get cracking on this one. It should be okay.

A

um The next slide has some repeated.

B

A

But also uh some new stuff, so we will hopefully fly through this and then we'll get to the more complex stuff using local memory and that kind of thing barriers. So if.

C

A

Any questions post them at the slack or um okay, any relations, so the work group size. So no the work group size is more um akin to say block size. The it doesn't need to um you know it's it's variable. It doesn't need to be exactly a work or, um but usually you want it to be multiples of a warp, usually say: 32, 64, 128 um 256.

A

So on yeah, one important uh distinction between the could the grid and the single uh nd range is that I think I'm correct in saying, with the acute grid you're saying you're specifying the number of blocks, whereas with the andy range we don't specify the number of work groups, we just uh specify the size.

C

A

The the global work item yeah the size, the global work item, kind of grid size.

D

What's the sql equivalent that calls cuda stream.

A

I supposed to be just just wait, I think, or calling wait on uh an event so with our particular um the particular implementation again uh maps a single queue to a string so calling weight on a queue will just uh it'll wait on it'll, wait on uh that particular stream, which is uh going to go to stream synchronized in the future. When uh this isn't the case and uh cue maps to multiple uh streams, then you'll need to explicitly kind of list the events and then call wait on those I think or you could.

A

You could build up a dependency graph within your within your application and then call wait on the the last event or something like that. So you can kind of build these streams theoretically uh and then call wait on you know the last one of those.

D

What about good device synchronize and is there a way to visualize the graph.

A

Is there a way to visualize the graph as an is? Is there a nice tool to visualize graph? Yes, uh as far as I'm aware, there is not, as far as I'm aware, someone correct me if I'm wrong.

C

Yeah, there's there's no way to kind of visualize the graph, but there's the pi trace too, which is could give you the sort of gives you all the calls to the the back end, which is not quite the same thing, but it's close.

D

A

The name of the tracer so we'll be we'll be looking at profiling tools uh in the final section.

D

A

So it will be there cool it'll, be there it'll, be there absolutely yeah and yeah we'll fly through the next section, um but just to um echo what courtney said so we have a pi tracer, um so the plugin interface again talks to the the back end, uh the plugin whatever that might be so you can so essentially, this is saying everything that the plugin interface is doing.

A

Okay, just to do this very, very simple thing: okay, so it gets a platform gets another platform, yeah okay, so it gets. Multiple platforms uh seems strange, but.

B

Yeah, it does a lot of things.

A

It gets the device info, so obviously you have some kind of a multiple calls to this device. Get device retain get info, so this just get more info, so these might be separate calls to get info. One is asking for something another that or something else their platform get info. Does something else.

A

The the plugin interface is doing a lot of things, so you can see how certainly the sickle um the sickle implementation is talking to the back end. So these are essentially messages passed to uh say, cuda, and then it gets by success or whatever um and then yeah.

A

This can be useful for if you're, if you think, you're getting some kind of an error in um in the way that the plugin interfaces is interacting with the uh at the back end or if you think that there's an error in the back end somewhere, then you can easily locate it. uh When I say easily, these things are usually difficult to uh find. Well, not.

C

A

Not always but yeah you.

B

A

Find out exactly what the plugin interface is doing just by using this command so single by trace, and then there are multiple options. You do one.

B

A

Okay, I think zero is maybe the default.

A

There's a default minus one as well.

A

That minus one is two.

A

B

A

Are we getting.

A

Do on have any questions.

E

I have a question before we move on yeah, so um we used only scalars right now, but is also possible to use complex object, structures or classes.

A

Yes, yes, absolutely yeah yeah you can you can use so yes, as long as you can.

A

I I think the the the definition is trivially copyable types, things that, like you know, don't have say, nested pointers that point to say you know, parts of memory or whatever. As long as you can trivially copy something from host to device, uh then you can use that in the kernel. Absolutely so.

E

Let's imagine I have a class that is, that contains maps of string versus whatever this is not going to be possible, but if it contains just like simple arrays, multiple dimensional arrays, that's that's, okay, right.

A

Yes, yes, but you need to make sure.

E

A

uh These arrays are accessible to the the to kernel code. You need to make sure that these are either malloc on device or that you're, not just passing pointers to say, host memory on um you know you need to you need to make sure that these things exist on device memory, yeah, cool and yeah. That's that's kind of a a performance question as well in terms of. Is it better to organize things in terms of uh structs of arrays or arrays of structs? And usually the answer is um structs of arrays.

A

It's better to have kind of the same kind of data types in contiguous chunks, as opposed to having arrays obstructs which might have this, and that and this and that, and this.

B

A

uh Usually, for for optimization, it's better to have yeah some sort of array style way of organizing things.

A

B

A

A cue just a normal cue, okay, allocating memory on the device. So we have a.

C

A

For for a for b and then for the result, we need to copy memory to a and b. We don't need to worry about r, because that's going to be initialized in the parallel form, um then we need to get the global id. So this is a one-dimensional, sickle id.

A

So we're just trying to get the the whatever the the index the only index that there is really as a global id and then we're indexing into uh a and b adding that and saving it to our doing, call, wait then copy back and check the results.

A

A

Let's see what happens.

A

Okay, correct results.

A