National Energy Research Scientific Computing Center (NERSC) An Introduction to Programming with SYCL on Perlmutter and Beyond, March 2022, 1 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2. Enqueuing a SYCL Kernel

Description

Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/

A

And queuing it colonel, so uh do we have everyone still do you think rod we're okay to go.

B

Yeah, I think yeah uh crack on yeah.

C

Can I can, I just have a very basic question: yeah this is lucian, um so I so what is then the role of uh sql targets, because you specify that compile time. I suppose it builds the kernel for the cuda, but then you can actually with the filter. You can switch what you want. So if I don't specify that one, it means that if I change the sql device filter to be gpu, that means I'm not going to have the kernel build. So I suppose you can't run it.

A

um Okay, so using sql targets you can specify multiple sql targets. So if I wanted to, I could compile for every conceivable back end and I'd be producing device code for every or device ir for every conceivable backend, and then you could choose a run plan. So the um the compilation and the device selection at runtime are very separate. You need to make sure it uses responsibility uh to make sure that you have the correct device code before a certain device is chosen at runtime.

C

Right, so that's just that's just to build the infrastructure and then at runtime you decide what you want to use from that infrastructure. Exactly.

A

Exactly exactly yes, okay,.

C

So now, when I got the errors, I was able to uh run on the host, but I got errors uh on on the device. That means it was uh an error with whatever gpu was using.

A

Exactly exactly yes and you can actually see that you're say this is lucian. Is it yes? Yes, yes, yes, sorry, you can see that your your error, um you know, is a is a error, that's being passed by the plug-in interface, the the pi code, pi cuda, so this is relating to the back end. I I'm not entirely sure if uh this would change if you're using an s batch um script, maybe there's something got to do with the permissions of your of your account. uh Potentially I'm not sure.

C

Or potentially the gpu on that particular.

A

Node yeah yeah it could be or yes so, maybe um maybe the if you run nvidia, smi or or something you might see that there's a really. You know serious job happening on that, uh but I'm not I'm not entirely sure uh that.

C

A

C

To understand, okay, thank you.

A

Brilliant yeah, thank you, okay, so I'm going to crack up okay, so first kernels, okay, so again uh sickle c, plus plus, but with you know, offloading okay, so learn learning objectives, learn about cues and how to submit work to them. Okay, so someone very, very um uh astutely, uh asked the question: how does this map to a computer stream at the moment, one to one in the future? It's not gonna, be one to one hopefully, because that will allow um for more concurrency, um so learn how to allocate transfer and free memory using usm.

A

So we're going to breeze through a lot of different functions for usm usm is unified, shared memory, so this is a way of dealing with memory that involves usually explicit allocations and uh freeze and sometimes transfers not always. Okay,.

D

A

How to define kernels learn about the rules and restrictions on kernel functions, learn how to print from kernels to the console. This is very, very useful for debugging.

A

So the queue in sickle all work is submitted via commands to a queue. The queue has an associated device that any commands enqueue to it will target okay. So when you construct a queue, it essentially gets some device by some manner. It might be a default thing that you can specify with some device filter or you could uh explicitly ask for a gpu.

A

You can also write your own device selectors, which is outside the scope of uh this, but you could feasibly, you know, choose a a device, that's only a cuda device that has a certain um string in in its name or something like that. You can define, define your own ways to select devices, so there are several different ways to construct queue as we say we're going default construct one just because it gives you a lot of flexibility uh at runtime. You can choose devices, so this will have the signal.

A

Runtime choose a device for you, and you can. You know override this using your device filter, as we saw so work submitted to a given cue can be executed in any given order. This is uh important uh in general when we're dealing with sickle the cues are not necessarily in order they can be executed in whatever way the runtime thinks is optimal. uh The scheduler thinks is optimal, uh as we mentioned earlier at the moment, uh because we're dealing with a queue mapping to a cuda stream. This, in fact, doesn't really um cuda.

A

Streams are in order, so at the moment, uh cues with the cuda back end are in order, but this is liable to change. So it's it's good to pretend they're out of order and that we explicitly specify that they're. uh That one thing follows another, so it is necessary to define a given task's dependencies. So the things that we want to the the events that we want to wait on for before the next event happens. So we can call wait by saying.

A

Okay, don't do anything on this until this has happened or we can when giving a task. When uh uh adding a task to cue, we can say: don't do that until this event has finished. That event has finished.

A

Okay and sickle events are returned from tasks okay. So when you enqueue a task, you get an event and then, when you wait on that, you can wait on uh the task or the sorry, the event or the actual queue itself, so constructing cues.

A

So uh here's a default queue, so we've actually already kind of lost or we've already gone over this very quickly um and then here's a queue with a gpu selector. So then, as we saw, this will turn runtime error if no gpus are available.

A

uh This is too straightforward.

A

So in sickle there are two models for managing data: um the buffer, excessive model. We're just going to mention this because we don't actually have time to get into this today and unified shared memory, so unified shared memory involves explicit allocations, freeze, um sometimes bank member copies, but not always so. The model that you choose can have an effect on how you enqueue uh kernel functions, we'll see that in.

D

A

So for now we're going to focus on the usm model, uh so you need to be familiar with um like this is a you need to know that this is a thing in cycle, but we just we're not going to cover this today, um maybe in a future workshop.

A

So there are different ways: usm memory can be allocated host device and shared, so we're going to focus on explicit usm with shared and device allocations.

A

Okay, so here's a little table. uh This is from the dpc plus plus book, which is very good recommended.

A

So when you malloc on device that pointer that's returned is not accessible on the host, if you try and de-reference it or access the data within that allocation on on host you'll, get um you'll, get a seg filter or a legal access error, it is accessible on device.

A

Okay, uh if you do a host malloc, it's accessible on the host and also on the device and it's located on the host shared. um I I it's not recommended necessarily to use a host malloc on device. If you wanted to share an allocation between device and host, it's better to use, malloc shared and this is accessible on host and device, and it will use the underlying cuda api, uh say malloc manage, which will allow you to essentially use the same pointers on device and on host and then it'll migrate. Data in between.

A

So malloc device, so this is just so. You have two versions: you have a c version. This returns a void star or you have a templated c, plus plus version, so, depending on your uh your poison uh yeah, they do the same thing. You can just pass in a template parameter maybe a little bit neater if you're, um if you prefer c, plus plus so again, this is uh only accessible on the device. Any pointer, that's returned from this is only accessible on the device.

A

uh So both these functions allocate the specific specified regional memory on the device associated with the specified queue. So you need to uh pass in a sickle, cue, okay, so it needs to be associated with the device, so a queue is implicitly associated with a device and a context, um but well most importantly, a device and, uh as a result, this needs to be part of the malac device, because the the malek device, because you're specifying what device you want it to happen on.

A

So it's only accessible in a kernel function running on that device very important, so kernel code. Again, this is the the device code, essentially so the only bit in our uh sickle code that is going to be run on device is the kernel function. So that's the only place where we can access this.

A

So we get a synchronous exception if the device does not have uh usm device allocations. We don't need to worry about that today and it's a blocking operation. That's sort of important a lot of operations are not blocking um in cycle, but this this is blocking okay malik shared yeah. This is convenient, uh uses, malloc, managed um and then the pointer is accessible using.

A

D

A

Malacan cued and run a singleness or is it blocking and just creates a cue to find the associated device?

A

uh It is not run async asynchronously, it is blocking, it is blocking, so it's it would be equivalent to and queuing it to the queue and waiting on it immediately, but we haven't gone through waiting yet, but um yet totally blocking uh and all of these are are blocking yeah. All of these uh malloc x uh functions are blocking, so this is convenient. uh We can make a single allocation and then access the the pointer from host and device, and then the the api will migrate.

A

The data back and forth, not as performant as doing uh explicit, mem copies because of the mechanism. The underlying mechanism to transmit data relies on page faults. Essentially um it relies on one device asking for the data and then the api realizing. Oh it's not there.

A

Yet now we need to go get it, uh whereas if you tell the the the api to send things explicitly, then usually, if you're doing a lot of mem copies, uh it'll be more performant and yeah, uh could I look managed uh yeah, potentially slower, okay, um three, so this actually should be signal free. uh The signal name, space signal three.

A

um So in order to prevent memory, leaks, usm device allocations must be freed by calling the free function. So there should be signal freezes in the signal namespace.

A

uh If you just use a normal for free, which is part of your normal uh c plus um library, uh then you might get an error uh and I think in fact you will get there and dpc plus plus okay, and this is also blocking and the queue needs to be the same yeah. That maybe goes to that saying.

A

Okay, uh mem copies, so this is important if you're using say malloc device. So let's just say that you allocate uh some space on a device and then you also have a vector you want to copy the elements from the vector over to device you you need to explicitly copy the the the memory over um yeah. This is probably straightforward.

A

And this same uh function is used regardless of what direction you're going, and so this the test might be on the host and the other one on device, or vice versa.

A

um Yeah uh copying between devices uh between cues is not necessarily allowed unless they share the same context, but that's uh something that we're not going to cover today. At the moment, we just want to think about host device device to host um another important thing, actually that I didn't mention in the previous ones.

A

um I don't know sorry, that's my log shirt yet sorry, okay copy, so we have this standard vector of dependent events, so we can actually pass in a vector of events that we're waiting on um so that this will not happen until we get the the go ahead from the previous events. Okay and this would return an event as well. So actually we could take this event uh and then submit it to the dependent events of the next kernel and so on.

A

This is a neat way of doing things uh which we'll see in this exercise. That's coming up. Okay, so pretty much all of these q member functions uh return an event pretty much. I think all of them do so it's a good idea either to wait on them or pass them on to subsequent events as dependent events. Okay, we'll see how this happens in the next exercise.

A

So mem said this is just setting um things in a particular allocation setting setting the value for num, bytes um and uh yep and then phil as well uh initialize the data with a recurring pattern.

A

So these are also asynchronous, so anything that returns an event is asynchronous and we need to pass these events on as into dependencies of other things or you can explicitly call weight on them and then nothing will happen until that has returned.

A

Okay, so now here's the good stuff so in queueing a kernel. So again these return events, so a kernel can consist of a single task. This is carried out by a single thread, a single work item, and this will be some sort of a lambda.

A

uh We'll see some examples of this in uh the next few slides, you can also um submit a kernel as a parallel for and this will be executed over a certain range, so we're just going to deal with a simple range now, so this is some either one or two or three dimensional object which says: yeah: let's have five in the x direction, 10 in that direction and 100 in the one in the z direction.

A

It's easiest just to start off with one dimensional. Obviously, and then you only need to think yeah. Let's do this for 10 24 uh work items and so on.

A

Okay, um so kernels take the form of function, objects or lambdas. So lambdas as we say, are used a lot in sickle yeah quite convenient. If you ask me, um the queue provides member functions which allow you to invoke a single task or a parallel, four. Okay.

A

So there are actually later on in the workshop, we'll see that there are other ways: I've been cueing, a parallel for or a single task, but these are kind of the most maybe straightforward, shortcuting ways, and then these can only be used when using the usm uh data management model. Yeah, that's that's correct. We don't necessarily need to worry about that at the moment.

A

E

What does it look.

A

Like so basic signal application, which used uses a shared usm and invokes a kernel function with a single task, so shared usm, okay, this is blocking obviously, because so it's it's blocking and another reason why it's blocking is because it needs to return something that is not an event. Okay, there's no way that malloc shared can not that there's no way that malik shared can return an event. Therefore it needs to be blocking.

A

I think that's the the rationale, so the we're uh allocating space for one of type t uh associating it with this particular queue and then we're initializing it on the on the host.

A

Then this is our kernel code, dereferencing it on the kernel code, the exact same pointer and then we're just going to square it uh and that's fine and then return it okay. So this is totally fine. We don't need to do any, um so we need to wait on the event. That's returned from the single task, but then we can just return the the value it'll it'll automatically get the value back from the the device.

A

The shared memory.

A

Okay, so we alice, we allocate usm device memory by calling malloc device, so this is a little bit more involved, so, instead of just calling malik shared and letting the the api do all the work with pointers for you, um we're gonna explicitly uh malloc on device, which means that we need some mem copy as well.

A

Okay, anytime, we want to do anything on on device, so we're mem copying to the device pointer whatever's in x, I'm not sure, what's in x and then just size of t we're going to square it, and then you need to mem copy it back. Okay, so actually I'm not sure if anyone is student enough to notice something that might necessarily go right with this.

A

This particular code.

F

Missing dependencies.

A

Exactly exactly yes, well done! Well done! Yes, yes! Well done, yeah, brilliant yeah, so essentially a queue can be out of order. Okay, so there's no saying that this you know will happen after that will happen after that, we need to actually define the dependencies. Okay, so we'll we'll go on to that. Next yeah well done!

A

Okay, so the easiest way to do that is just call wait.

A

Go wait then it'll happen kind of in order, or you know we will wait until uh each is exited. Okay, a little bit more elegant if we can use explicit dependencies from events, because it means that we don't have to have a linear, uh dag, a kind of whatever completely one-dimensional dag.

A

This is a little bit nicer. Okay, so we have e1, which is returned from the them copy. Okay. So this is your dependency that you're explicitly naming in here.

E

A

And then we have an event which is returned from the single task, and then we wait on that. So this is our dependency in this one. Okay, and then you uh get the event that should just be a single dot. That's title: yeah, okay, so this is a little bit nicer because you can actually you can kind of.

A

You can have a complex, dag, okay, let's see something: okay, so with just weight, you're forced to wait one after the other, whereas if you explicitly name your dependencies using events and some vector dependencies, then you can essentially have a an arbitrarily complex tag. Okay, and this this will, you know, make a lot of difference in terms of writing performance code um yeah. It concurrency is obviously very important, so yeah. We need to do this, so then this would depend on event. One.

A

This would also depend on event, one they would have no dependency on each other, so they can happen um synchronously or not, synchronously, sorry, they can happen concurrently um and then this depends on both of them. Yeah kernel function rules, so they must be defined using a c plus plus lambda or a function. Object uh cannot be a function pointer or a standard stood function. So this is, as I said earlier,.

D

A

So you need to use a c plus, plus lambda or function object. I would personally recommend lambda, but it's a matter of taste uh must always capture or store members by value. This is very important. So, when you're defining your single task, you need to use by value. Okay, you can't pass things by reference into a into a kernel because, well certainly with um say with uh malik shared or whatever you might be dereferencing things in the wrong way.

A

You want to pass them by value and that will uh adjust them to be submitted by the so that they're appropriate to be run on the device.

A

um Yes, so you can name your lambda if you want you don't actually have to so. This is a dpc plus plus extension. You used to need to name your lambda, which we'll go through later. It could be really useful when you're profiling, but you don't have to anymore so cycle kernel, function names.

A

They need to be unique as well, but we don't need to think about naming them at the moment, so sickle kernel function, restrictions, so no dynamic allocation. Okay, no dynamic polymorphism, no function, pointers, no recursion! Okay, these are sort of you know, set in stone um kernels as function objects. Okay, so I can. I can uh um so. This is uh with a lambda okay, just some lambda, which is being passed. uh We see by value okay, but we can also use a function, object. Okay, just the same. It's okay.

E

A

um And then you would, you would construct the function object within the within the submission of the the single task.

A

um Okay, kernel printf: this is the best thing in the world um if you're doing any debugging, it's it's amazing to be able to, but no it's it's expected to be able to print out from inside the kernel uh as it is with cuda and, for instance, so yeah gpu, single task, hello, world, brilliant.

E

A

uh Implement a single application which, in queues, okay, so sorry, um there's no question slides here, but does anyone have any questions.

G

I just put one in slack but return type.

E

A

Can a sickle kernel is that um uh that can a signal kernel in the form of a function object return something or does it have to be void like cuda kernels? uh That's a good question, um so yeah gordon.

A

In general, I I'm I'm not actually sure if it needs to be void, but certainly like the value would be would be lost. uh Gordon. Is that correct.

F

That that's right, yeah, I'm not entirely sure if implementations enforce it, but it's it's! It's expected that the criminal functions are void any. If you want to return anything from a kernel, it has to be done through uh your access or a usm pointer, there's, no return type yeah.

A

You definitely try it out. I'm sure that um potentially you can submit like there's, no reason why you necessarily want to like it's not like a kernel or a function, object that you pass into a kernel. It's not like you want it for another purpose as well. That needs it to return something other than void, but yeah you can try it out, try it out, but in general you have void.

F

Yeah, I believe the spec says it should be void, but I don't know what the implementations will do if you, if you try.

E

D

A

G

So you showed single task as the beginning example. Are there also multiple tasks or repeated tasks, or what are the other uh related stuff to this.

A

Yes, yes, yes exactly! Yes, um so yes, so this is we're we're using single task at the moment, but uh the other one for parallel uh is parallel for so this particular exercise.

A

We were just going to be dealing with single tasks just so we can get our head around submitting things getting events so on, but a parallel four can be defined on a range- um and this is just a simple global range, but we'll see later on that there are more um complex ways to construct uh a range which you know is similar to the idea of uh say, threads and blocks, and you know grids in in cuda.

A

But this is just a naive global range saying how many work items do we want uh we're, not worrying about. You know, uh work group, size block size, we're not worrying about that uh in this particular instance, but there we'll we'll get on to that later.

A

But for the moment we're just looking at single tasks just to get our heads around some of the other concepts.

G

Yeah uh arrays as arguments or as captures.

E

G

A

More time a race, oh, can we pass them as as arguments? Yes, certainly yeah, certainly yeah. So it's it's easy to to pass and buy captures. But um yeah you can pacify arguments as well. Yeah.

G

A

G

No pointer double pointer stuff like in c or something crazy like that.

E

No, no! No! No! No! No.

A

No, no, no like these.

A

These kernel submissions that are not going to be um they're not going to be altering the the say, the underlying pointer they're just going to be acting on the data at that point, so you don't need to um pass things. Well. Actually, sorry, I should say things need to be passed by value.

A

So if you're passing say a pointer um as an argument, you need to make sure that you're not passing it by reference, because we don't want the possibility that it might be altered uh by the kernel, I'm that might throw an error. I'm not sure if you tried to do that compiler, but yeah, always by value um in general, and it's easy to do that with just a value capture. Yeah capture is very easy yeah.

A

G

You pass a vector, for instance, std vector.

A

A vector a student vector okay, so you don't necessarily want to pass a state vector to your um into uh so the stood vector lives in host memory, so there's no way of accessing that from device from a kernel.

A

um If it's running on a device, you need to make sure that you allocate on the device you send the data to the device and then, when you're finished, doing whatever you want to do you send back the results you can't access.

A

um You know, allocations that are on the host, so uh stood vector is a host allocation, even though you know it doesn't necessarily call itself that you need to make sure that anything that's being used on the device has a device allocation, has a shared allocation using the buffer accessor model. There's a quite a neat way that vectors create buffers which are then accessed with excesses, but we're not covering that today.

F

Yeah, I think, generally, the problem with using stud vector is that um it can kind of dynamically reallocate the the memory which, obviously that that done on a kernel is is going to be problematic and stood array is, is um can be used like there's quite a few places instead of re-used internal, um but um I think, generally, you wouldn't you wouldn't want to use the vector.

E

Yes, yes, yes,.

E

A

We might um start looking at the next exercise.

B

Yeah, um I was going to say that um obviously, we've been going for a little while uh we could get people to work on the exercises and potentially have a bit of a break if they want to get away. Yeah.

D

A

B

What time do you want to restart? uh Personally, I think we were scheduled to restart uh a little while ago, but we can shift things forward a bit.

E

Yeah um up to up to you really.

B

I guess, let's give people uh say 25 minutes to do that exercise and then have a bit of a break, and then that takes us to brilliant ten past uh ten minutes. Past um 11, pacific.

A

Yes, so we'll have a break at 10 past.

B

Years, I think, do the exercise and people can have a break and have a reason when they want them, we'll restart 10 minutes, um brilliant television. Sorry, I'm trying to work out time zones. I know I don't.

A

Know, yeah. Sorry, it's yes, seven hours difference.

E

No sorry, no six hours.

A

Okay, okay, so let's have a look at this, um this okay, so this is uh our latest um task. Okay, so essentially we want to let's look at the readme as well.

A

Okay, so instructions we want to allocate two ins on device a1 or a is one and b is two okay, so we need to mem copy to initialize device memory for a deb and b f, um so yeah. Now we want to use a single task to multiply a dev by two use. A separate single test to multiply b dev by 100 then use another single task to add the results of both together and store the value in a then. We then copy the value back to host and print this standard out. Okay.

A

So the challenge here is either to call weight between each thing, which would give you this linear um execution graph.

A

It's a linear execution graph, but actually we can see that in this case uh tasks two.

E

A

Sorry three and four are completely independent of each other, so they should be able to happen at the same time. So if we can name our dependencies in a way where these two don't depend on each other and then the final event does depend on these two.

A

That would be nice uh not essential, but this is a nice yeah, a nice use of a very, very simple tag. uh So then the dependency will be on this for both these two and then this has a dependency of both of these and then the main copy will have a dependency of that.

E

A

So we're just using single tasks and single values. This is just a single end, single end. Okay,.

A

And then we have a few pointers here so constructor.

E

A

Allocate device memory so uh we can use malik device or maybe malik shared if you like, um mem copy and then free memory, single task. So on.

A

Okay and obviously you should.

D

A

Try and get this this answer, hopefully.

G

For a malloc device, is there a default property list for the last argument? No.

A

No, no sorry, the so the property list is uh the default. Property list is empty yeah, um so you don't need to worry about specifying a property list, so these are they're kind of there just in case at some stage it becomes a good idea to um to implement properly lists for these things, but I I actually don't think that there are any defined properties uh that you could pass into malik device. So I I correct me from wrong gordon.

F

Yeah, I don't think there's any properties available you can use at the moment. um Generally, most cycle classes can be constructed with a property list, but it's in a lot of places. It's there for start forward.

F

um Compatibility with you know, potential properties in the future, things like the queue in context of properties and buffers and properties, some of the classes kind of have the capability, but.

E

There's nothing specific. You.

G

uh This is a stupid, beginner c, plus plus question, but when using the template syntax do you want uh int or instar for the um the one that goes in the angle brackets.

A

Will return in star so you're allocating you know a certain amount of ins and then you get the pointer back.

G

It has to be instar, even though it's just a scalar, just a single uh integer, because it's a pointer, a device pointer.

A

Yeah you're still asking for a single pointer and then you get the point. Sorry you're still asking for an allocation for single lens, and then you get the pointer to that end. So if you pass in say in star, then you'll get an int double star out.

A

G

What I mean yes, so the type name will be just.

A

Int, yes in the angle in the brackets yeah in the template brackets, I.

H

Have kind of a general question: that's all right about the sickle yeah, um so it seems like a lot of the uh it seems like it really relies on building a graph with the right dependencies for for each um kernel in the queue or- and that seems like an easy thing to mess up, just like to forget to add one dependency and then you're left with some very difficult to debug um race condition of some sort. Much later on down the line. Are there any strategies to sort of help? Do this correctly.

A

Definitely if you're just trying to get code working just use weights. If you wait, then this will enforce this. This linear, uh whatever one-dimensional chain of execution, uh that's a an easy thing to do, whereas if you try and do the more complex things, maybe this is a a a little bit more subtle, more nuanced, uh but it can give better performance- uh theoretically at least well, yeah, theoretically, um but yeah in general.

A

If you're trying to kind of remove elements of asynchronousness uh just call weight, because that will just you know, it'll it'll wait on whatever it is, so you know in fact things will be happening. Sort of synchronously. You might say.

H

So yeah that makes sense thanks and I guess relatedly do you find that you know medium complexity like scientific sickle applications do end up with very complicated branching graphs or or is. Do you find that more the computation is embedded in the kernel such that you do have a pretty simple flow.

A

Well, personally, I've been working on say some deep neural networks, libraries recently, which is uh what I'll use but yeah. Definitely there is an element of concurrency um which you know would involve this kind of a branching sort of dag. But um it's not necessarily the case. It's really like implementation or it's really uh application dependent. uh I I'm not people at the the labs can maybe answer this question, but better than I can.

A

F

From my experience, I think the the generally when you see the um like the dax like this, is when you're doing sort of um copying data whilst doing compute the same times like double buffering things like that or you know, using multiple devices and doing load balancing things. So that's where more complicated tags tend to come up or if you're sort of doing interrupting between cycle kernels. For things like you know, including, and something.

G

Next, you have to use printf in device code or you can pipe to sddc out.

A

uh So uh printf printf not standard search.

A

So there there are other things called single streams which allow you to to pipe to a single stream which is declared in um to a well without getting into too much stuff. There are mechanisms by which you can use kind of streams, but it's easier to use. Printf.

I

Maybe to to go back to the dependency chain, I know you will not talk about buffer and stuff, but one of the big advantage of buffer is handling all this data dependency. For you automatically right, and I think it is one of the nice advantages like in theory. The runtime can be smart enough to do all this kind of interleaving and just put the correct dependency automatically for you, and I think it's a good, uh a really good thing to use, but parting, your code using buffer, is a more involved indeed, yeah.

A

Absolutely yes, I should have mentioned that so the other memory paradigm, the other memory model, the buffer, um excessive model. It pretty much does all this stuff behind the scenes for you, so you don't need to worry about it at all, but um this kind of explicit dependency naming this is maybe more akin to other um yeah.

I

A

Apis, like maybe cuda or something.

G

I think you mentioned sickle weight, so weight is in cypsical name space, but why is the slide 24 saying uh it's a method as well of a single task.

A

Sorry, can you can you repeat that.

G

Yes, that one, what is it dot? Wait.

A

Don't wait? Okay, so essentially this returns.

D

A

So this q submission returns an event, and then you can call wait on an event which um yeah it means that nothing else will happen until this is returned.

A

So you can call wait on an event or you can call wait on a queue, so in this case we're calling it on an event, because this is the thing which is returned.

D

A

This, this um q dot single task.

E

If yeah, if we look at sorry, if we look at a single task, single task.

A

Single task returns an event, a single event, that is a simple event.

A

Oh yeah, nice yeah, okay, so actually this is something that um maybe related to what you're saying, but um let's just say that we wait and we want to assign that value to something so wait actually doesn't return an event. Okay,.

E

D

A

Not sure exactly what it does return in fact, but I've got a feeling it might be void or something.

A

So the second spec is your friend for these kind of questions.

A

Sorry yeah the error. The error would suggest that um yeah.

E

A

Returns void, so weight returns, void.

E

D

So that's a that's a nice! uh That's a nice book.

I

I

Yes, maybe the difference between the two are the granularity right, where q way to wait for all the comments that you want you into the queue to finish where, if you wait on an event, you wait only for this event to finish right. So it is a the little difference between the two. So if you are in a need order, queue in like coulda way both are totally equivalent right, but if you are more in the out of order way, they are totally different. It's not the same granularity.

I

Maybe it isn't yeah.

C

So what happens? If you have you pick up the event with the variable and then you are going to call wait, um you sign it. um How so where's the weight? Actually acting? It's on the cp on the host.

A

um So the weight yeah the weight is happening on the on the host.

A

It's essentially, I I'm assuming that it's an interaction with the the plug-in interface um which interfaces with say the cuda, the cuda driver in this instance, and it's saying, okay um on the host: let's wait until we get uh the plugin interface saying yeah kernel completed success, um so yeah we can wait for an event which would be you know an underlying cuda event or we can wait for the queue and, if we're waiting for the queue, that's um waiting for essentially everything in the queue to complete.

A

C

If you have, um if you want to execute multiple copies, for example, and the order doesn't matter, you could potentially just just launch a bunch of mem copies and, uh and it doesn't matter like you can just you don't need to wait for them.

A

Yeah exactly exactly yeah.

C

So if you do, let's.

A

That's a really really good question. So, if you do say you know a million copies, you just need to wait on the queue after you do all those main copies and then yeah.

C

You don't pass that point, but you are not interfering in order to to disturb how they are actually doing the copies.

A

Exactly but in fact you don't, you don't need to worry about the individual tasks right, because the queue has a record. It kind of has a hidden record of all the tasks. Sort of um that you can just wait until all the events have completed.

C

And um if you want to pin the memory uh on the device, how do you achieve this one over here is when you actually create the queue, I guess or the memory when you actually create the memory with molok device? I guess so when you say memory, but there are multiple ways in which you can allocate memory on the device right.

A

Yes, yes, yes, of course,.

C

So one of them is actually when you fix the memory right.

A

uh Sorry, when you, when you're setting the memory or when you're, when you're copying to the memory.

C

Go ahead, I was exactly.

F

I could probably answer to this um so the moment uh sickle standard sweat, doesn't have an explicit way to to do, pin to memory, um but because it stays there, it can kind of vary from one kind of back end to one part device from another, um but generally, if you allow the second thing to allocate the memory for you through, like not device not um and as long as sort of your the kind of the size of memory you're allocating, is sort of along the lines of what you would.

F

The kind of that platform would recommend in terms of you, know the size of memory, the cashling like multiple caches and that kind of thing, and that should it should allocate it in memory for you. So it's kind of a quality of implementation detail. I think there has been some interest in a way to kind of properties to be able to explicitly request that allocations are pinned and that's something that we may see in the future. But at the moment there's not an explicit way to to do. It is more implementation.

F

Detailed quality of implementation- I see most implementations- will likely have some some sort of guideline in their documentation and how to do. Pin memory and xenophobia.

C

Oh, so you might actually depend on the actual implementation of of sql.

F

Yeah, it can sometimes it's usually kind of some similar kind of guidelines, but it can vary in between the implementations and I think we might see kind of a more standardized way of doing in the future. It's definitely interesting.

I

And maybe one a general comment is because you do just cuda code at the end, you can just use nv pro for whatever tracer you like, and you can verify how the map right. So this is also the good thing with all this uh offloading model, or something like that. This is like you can always check what the back end is doing so at the end is nothing is magic and you can check if indeed they pin memory, for example,.

C

What happens if you run this one, let's say on a knl, um uh so you might have cpu, you might run on knl. So then you also have different types of memory right where how do you control, uh which or even on the gpu right like if you are using texture memory?

C

um How do you control where it goes.

A

So this is it's: it's easier.

B

I

With accessors.

A

So using buffers and excesses um you don't have the same control uh when you're just dealing with uh malik device and that kind of thing as to where your your memory is actually what kind of memory that you're using um so for that the buffer accessory model is better uh we're going to be looking at using cuda shared memory in. I think, then, if not the next one, then the one after that so you'll you'll kind of see um how this is done. But it's using yeah buffers accessories well, not necessarily buffers, but accessories.

C

And um if you are going to run this code on the host after that, um essentially is going to skip. I guess this this step of copying the memory or what does it do or it does a local copy to the memory or.

E

In in which particular.

C

Example, so so you you have this code that you produce, but then you can run it on the host like just uh oh, yes, yes,.

D

A

And so what what happens, if you do it.

C

Yeah you do a mem copy, does it do anything or it just skips it or.

E

um So if you do a mem copy.

A

I would, I would imagine it skips it, I'm I'm not entirely sure. I think it's it's. I would imagine it's defined by the implementation and what exactly dpc plus plus does in that case, uh I'm not sure gordon. Can you.

F

So the question: what happens to the host device.

A

What what happens for allocations and mem copies, where they're not needed, essentially, where you know you choose the device to be the host device? Are these just simply kind of emitted or do they default to.

E

A

A lock or something or host malloc.

F

And so I I believe with uh with usm, because it's explicit um it will, it will still perform the copy, even if it's strictly unnecessary and with the with the buffer accessory model. It's a bit more forgiving in the with the buffer access is rather than sort of explicitly kind of prescriptively saying what you want to be allocated copied when, where you're kind of describing the um the requirements in terms of kind of what memory you want where and when and then the runtime kind of, does the efficient thing for you.

F

So with buffers and accessors it's a bit more forgiving usm. You have to be careful that you're not kind of doing copies, unnecessarily.

F

A

It certainly wouldn't have any detrimental like impact to your your code anyway. It would maybe be wasteful, presumably you're using it for debugging anyway, so yeah yeah.

E

Thank you. You're welcome, okay,.

A

I think we might start uh the next section, uh maybe I'll go through the the example very quickly.

A

Okay, so yeah, essentially um yeah here we have a and b uh construct queue, allocate memory on the device. Okay, just size, one mem copy to both okay, we're getting the return values from both okay, um not necessarily.

A

Essentially, we could also just do something like uh you know, a general q dot weight here, but yeah.

E

Well, for the sake of this.

A

We've done this okay, this has a dependency on e1. uh This single task has a dependency on e2. Okay, we're also getting the events that are returned from these, and then we have another single task which has both of the um individual single tasks for the dependency, then we're just going to add the two together.

A

Okay and then we can call weight on that. If we want.

E

D

A

Mem copy back wait and then that should.

D

A

Okay and we need to call it sickle free if we don't call sickle free. If we just call free, then you might get an error okay. So let's just do.

E

A

A