National Energy Research Scientific Computing Center (NERSC) An Introduction to Programming with SYCL on Perlmutter and Beyond, March 2022, 1 Mar 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 1. Introduction to SYCL

Description

Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/

A

So welcome everyone to an introduction to programming with sickle uh on pearlmutter and beyond. um We're really excited to offer this training event um and to host uh host coldplay who's, really an expert on encycle um to to help help us learn more about this programming model, so I'm brandon cook I work at nurse in the application performance group um amanda dufek is, is also in the same team um and then helen is here from our user engagement group and then from codeplay. We have um hugh who's.

A

Gonna lead the the training from from their side, so um yeah and and gordon brown and rod burns are going to be here to to assist as well.

A

So yeah some some basic logistics um so make sure that you are are muted. When you think you are, you know, mute when you join, um I think you might be muted by default if you join. So if you are attempting to speak and no one's responding, um please check that you're off mute.

A

um Also, we ask that you, please rename your your name in your zoom session. um Instructions are here on the screen um and this this really. It will allow us to streamline helping helping you out if there's any any kind of support, question that that comes up during the course of the the training event.

A

So as you've probably seen from the the pop-up, um you know live transcription and um full view of the transcript are enabled, and we are recording this. So if, if you do not wish to be recorded, this is your your notification or reminder to keep your your video and audio off.

A

If you want to just listen, um please ask your questions in the general channel in slack, this is preferred instead of the the zoom chat, here's the link to to join that channel, and this link should also been provided to you via your email, invite for this event,.

A

And I guess, as I mentioned, the slides and videos from this will be made available after after the event's over. So um if there's there's a, if you want to reference this content later, it's going to be available, um there will be a hands-on component.

A

So please um find you can find this in this repository on github.

A

And uh finally, I'd like to say, I invite you all to please answer the the following survey: that's linked here, I'll put the links uh into the chat momentarily after I'm done, sharing screen answering this survey really helps us improve these events. For, for everyone,.

A

Okay, so using perlmutter um all if you're already a nurse user, you will have been added to the end train one project, if you have a training, account those expire on march 8th.

A

We have a reservation for some compute nodes on perlmutter, that is from 9 to 3 p.m. U.S, pacific time.

A

In order to access that reservation, please add the the following directives to any batch script that you're using.

A

I think I'll also call out that um you know due to the limited number of nodes available. um Please prefer to use the the batch system as opposed to allocating nodes interactively within the reservation and um you if you're in the training account you can also access the regular promoter nodes, just with the same account.

A

And then, if you go to docs.nurse.gov systems, slash perlmutter this will um uh this is the the sort of landing page to start off with any any promoter specific documentation.

A

And um I think we're gonna we're gonna cover this again in more detail, and there will be. um You know. I think the slides have been posted in the chat, but we'll also um we don't expect you to copy write, rewrite these commands down, um but codeplay has prepared a module for us to use with the compiler um and uh we'll we'll get into to actually using it in the hands-on portion.

A

um I guess I I'll also state at this point that this compiler is not um that it's not using open uh mpi at the moment, so um we're focusing on the kind of on node programming model at this time.

A

But if you do have a need to mix in mpi um I'd invite you to um please send me a message, via by slack and um I'll see. If we can, I can figure out a way to accommodate that for your application.

A

um Okay, so, finally, the schedule um we're gonna start with the introduction um and then discussing about how to actually get some work scheduled on the on the gpus, we'll have a few short breaks, um and then we will have uh some discussion on profiling and debugging and then an open question answer session, followed by uh the end of the event, um and so with with that uh thanks everyone for attending and I'll stop sharing and hand it over to you.

A

B

So welcome everyone.

C

B

um Thanks for joining um so my name's hugh I'm a software engineer working at coldplay and.

D

B

Yeah, we're gonna be looking at some kind of basic uh features of the language and hopefully covering as much ground as possible in this short window. um So we have lots of uh materials available online, so this will kind of be a jumping off point into the language, but also using the language in uh yeah, a kind of like serious way so yeah without further ado. And if anyone has questions, please interrupt me at any time um you can. You can use a slack you can.

B

You can uh use the um the chat in the zoom, but I think the most reliable way is maybe just to interrupt me. So please feel free to interrupt me because um everyone has questions all the time, and if we ask these questions, then everyone gains from it. So definitely please ask some questions.

B

Okay, so yeah, let's get stuck in okay. So what is sickle uh so this this uh introductory chapter will um give us an introduction to sickle and also in um to the compiler that we're going to be using on promoter, um okay, so the learning objectives for this um module so we're going to learn about this signal. Spec and its implementations learn about the components of a sickle implementation, learn about how a sickle source file is compiled and where to find useful resources.

B

So we're kind of going to be glossing over uh details about implementations in general and focusing more on the uh dbc plus plus, which is the intel one. Api implementation uh with cuda back in this, uh especially so sickle, is a single source, high level standard, c, plus plus programming model that can target a range of heterogeneous platforms.

B

So we're going to be repeating.

D

This and going through the elements.

B

So I wouldn't uh try and take all this in at once. Okay, so a first example of sickle code. So this is just what sickle looks like in the wild. um So you can see, I'm not sure if uh people can see my my mouse.

B

Yes, yes, good, okay, good, okay, so um yeah, essentially we're constructing a queue. This is associated to some device. So this is kind of you know uh some kind of a unit of work. That's like a list of things to be done. um We'll we'll go through this uh in more detail later, so maybe uh don't try and remember that too hard uh we malloc some some memory. uh Then we do something. We sorry we initialize.

B

Then we do some some kernel codes, so this is essentially a sickle um yeah, sickle code, you'll notice that we have so this parallel four, which is kernel code. This is uh within your normal c plus plus file. So this is uh a core aspect of sickle as we'll see so signal extends c, plus plus in two key ways, so device discovery and information. So we can find out about what devices are available. What devices we can choose.

B

You know heuristics for prioritize this device over that device device control, so the dispatching work to a particular device and and so on so signal is modern, c plus plus so sickle is essentially built on c plus plus modern things, like uh templates and lambdas, and that kind of thing. So, if you like templates, you like lambdas and you use them a lot, then you'll like sickle um yeah, so sickle is open source, multivendor, multi architecture, okay,.

D

B

Is single source, uh high-level standard, c-plus, plus programming model that can target a range of heterogeneous platforms, so emphasis on the single source, so uh within the same file we have host code, uh cpu code and the device code so um code, that's supposed to uh code that we want to target say you know a gpu or some some other offloading device.

B

So we have kind of parallel host compiler device compiler, which are then linked together to make uh a bundled executable, um we'll go through this a few more times. um So the main idea is that there are two separate compilers. You have your device compiler and also your your host compiler. There are many kind of compilation processes, so um we don't need to get too bugged on that, I'm going to be speaking in a second about the the cuda specific compilation model for dpc, plus plus high level.

B

Okay, so uh sickle is a high level language. It's based on modern c plus and it's it doesn't add any any uh language features that aren't already in the language um high level abstractions over common boilerplate code, which is a great thing if we're used to dealing with things like maybe opencl or more low level kind of apis.

B

So it gives us a high level abstraction over common boilerplate code for device selection platform, selection, kernel, function, compilation and dependency management and scheduling. This is something that's quite natural and sickle. The api really allows us to do this quite elegantly, so this is. This is one of the great things about using sickle. In my opinion, uh standard c plus plus so sql doesn't add. Any language features that aren't already.

E

B

The language uh they're implemented in the back end in you know particular ways by an implementation but they're just normal language features that use things like lambdas and templates to write kernel code, unlike things like cuda opencl, um you know other. uh There are no pragmas like in openmp that kind of thing um yeah and we can target a range of heterogeneous platforms. So this is another really great thing about sickle.

B

We can take the exact same code and it can run on um as many back-ends as are supported, essentially, unless we're doing very, very specific things, particularly to some, maybe cuda specific to some particular hardware. But in theory we can target cpu gpu apu accelerator, fpga dsp we can. We can target loads and loads of things. um So this is a really yeah. There's a good thing about sickle. Definitely uh the interchangeability of what we're actually offloading to.

B

So the sickle spec, uh the first version of the second- was sickle 1.2, we're currently on sickle 2020.

B

So this this uh the spec has been defined and the imminent implementations are um they're they're, almost fully uh implemented sickle 2020. None of them have completely finished implementing the the spec, but we're pretty close to done um in terms of daily use. It's you know it's. You wouldn't know that the entire spec isn't uh implemented. It's you know very workable, useful.

B

So here are an overview of um some implementations. We're going to be focusing on one api, uh dpc plus plus particular with particularly with the uh the cuda backend uh codeplay also has its own sql compiler called compute cpp, but yeah we're going to be focusing on this one.

B

Okay, so what a sql implementation looks like, so the sql interface is a c plus template library that developers can use to access the features of sickle. So this is this kind of box here. This box shouldn't be highlighted this box here, so the languages it kind of is used in c plus, just through templates through just standard c, plus plus features. So this uh it looks like just normal code like you're, using a normal library, so the same interface is used for both the host and device code.

B

Yes, so this is important, just it's it's all c plus, so the host is generally the cpu and it is used to dispatch the parallel execution of kernels. So this host device kind of um it controls or it's the your standard kind of cpu serial um device that uh yeah it's your the the standard cpu, I suppose uh execution.

B

You know, model executor thing, that's a bit of a random way of saying, but yes, uh the device, then is your accelerator um you're offloading uh processor like a gpu like an fpga like um whatever that might be um so the runtime library uh schedules and executes work?

B

Okay, so this library here so it loads kernels, um it uh dispatches them to whatever offloading device you're using um and it schedules the the runtime schedules, which kernels should be dispatched at which time and um so on, and it tracks dependencies between kernels between uh kernels or operations like mem copies. um This kind of thing.

B

um Okay, so host device, so this uh typically was something that was in the single spec for a single 1.2, it's no longer actually in the spec, but it's still there um for dpc, plus plus at least this is kind of a so we can decide to run our code using the host device, which is the the cpu device. So you can interchange these devices, which is great uh for debugging. Certainly, this is really really useful, but this is kind of provided by the implementation or not. It's not necessarily a core language feature anymore.

B

So this is a great thing for debugging, but once your code is working on the host, it doesn't necessarily mean that it's going to then work on. You know an offloading device. You need to also uh debug, sometimes usually on the um the device in question as well.

B

Okay and then the back end interface. So this is where the signal runtime calls down into a particular back end in order to execute on a particular device. So for dpc, plus plus, this is called the plug-in interface and it's just it. It talks to the the cuda driver, um so dpc plus plus uses the the cuda driver api um and the plug-in interface talks to it sends kernels, builds kernels sends them off and then you know awaits for response and that kind of thing.

D

B

And then the device compiler is.

D

B

Of separate, so this would generate in the case of the cuda back end. It generates ptx, which is the cuda good assembly, and it also generates a cuda binary, which is puts into the final um executable, we'll we'll cover this again uh later.

B

Okay, so our standard c, plus plus compilation model, uh when you usually compile your normal code or whatever. This is kind of glossing over a few things, but in general you get your c plus source, you compile it into an object, then you link it and then this is linked with multiple objects or potentially yeah, uh maybe static libraries or something, and then um you this is spun into some executable and then you give it to the cpu.

B

Whatever runs it, okay, so the question is: how do we do this for uh both the cpu and an offloading device?

B

So in this case, where we have our device code, which is embedded within the source code, you know as a as an element of it. uh What do we do?

B

So it's important to note we'll go over this a few times over.

D

The course of this workshop.

B

So usually, these are function. Objects, as in the kernels, the kernels are function, objects or lambda expressions, okay. So, uh by the way this is phrased. um uh Maybe unfortunately, but it's not a standard function, uh forget st function, it's a function, object or a lambda expression. It can't be a stood function.

B

um Okay and yes, so, let's see how this works, so the sql device compiler produces device ir which in the case of cuda, is ptx, um okay and the default. Actually, the default is spear v. So this is a device ir that can be consumed by opencl devices. um But when we're dealing with cuda as we are today, uh it produces ptx. If we tell it to, we need to tell it to as we'll see.

B

Then this device ir is linked with the cpu object, and then you have an embedded device ir within this kind of executable binary, and then you can dispatch it and then it'll split it up and run it at run time, okay, um yeah. So the idea is that you have these kind of, like independent uh compilation, streams that it's only when they're linked together. You need to kind of bring them back together, um yeah so now dpc plus plus recuda.

B

So we're going to be talking a little bit more specifically about what dpc, plus plus does with the cuda back end. So dpc plus plus, is the one api sickle compiler device in host code is written in the same c plus file.

B

So that could be anything you know like whatever cc cpp etc.

B

So um I'm not sure if everyone has had the chance to um load the module on parallel motor or if you have access, but the first thing that we can do is just to check that our install is working. So I'm going to go over to my parameter tab.

B

B

So here you can see that I'm not sure if that's big enough so sickle less is um this. Is you know the first kind of thing that we should use when we're checking what devices are available? Is the um the dpc plus plus kind of installation? Is it working um uh yeah? Actually, sorry for?

B

Oh sorry, I might just um so. This is the command or the commands that we need to uh run in order to load the module.

B

um I'll. Maybe is this in the general I might post this in the slack if everyone has or.

C

Brandon's just.

B

uh Posted in the slack, thank you, brother yeah,.

D

We could put it on.

C

This slack- and I think it's also on the readme for the the repo that people will be using as well so yeah yeah.

B

Okay, so essentially.

D

B

This will tell us what devices are available. Okay, so we have our a100 and we also have a host device. Okay,.

D

B

These are the devices that we can choose from okay, so we can see that there's a kind of we have a few um different kind of um there. It's like a device triple, I suppose, so we have x one api, cuda, uh gpu and then zero. This is as in this is the the first device of this particular of the first. You know two um entries uh you have host, which is you know, host as well, and it's the first host some signal implementations have multiple hosts, but yeah.

B

We don't need to worry about that, um yeah, okay, so to use a dpc plus plus compiler, so we compile device code to spear v. This is the default uh using just fsickle we're not necessarily interested in that at the moment, we're interested in compiling for the cuda back end, so this will generate ptx and also ptx binary. um So if we do this with that copy very nicely, so, let's see and then I'll just do test.

B

Let's see: okay, yeah.

D

B

Earlier um so compiling is okay, but um running code should be done uh using sbatch, um but yeah. I'm not sure if I'm allowed to do that, if I'm above this law- maybe yes um so you're running on this device. Okay, and actually uh a really uh neat thing that we can do to change the device. That's selected by the runtime is use signal.

B

Device filter is equal to let's say cuda.

B

Okay, so it's running on that yeah great okay, let's see so we know that our sickle ls does that we have two devices. We have the host device and the um the a100. So that's that's all we can do really. Okay.

C

So that's success.

B

That's great: okay, let's see what it looks like if we compile and we forget this flag and we can pop for sphere severity which cannot be consumed by um by our cuda device. Let's just see what kind of an error we get. Okay, so, let's just say hey., so we know that. Actually, when you run this without specifying which device should be chosen, it seems that the default is the cuda device.

B

Okay, so run this. Oh, what happens? Okay, so terminate called after throwing an instance of a sickle runtime error, so we have invalid binary.

B

Okay, so that's essentially saying that spear v was passed to the cuda device and then the cuda device- can't you know, eat that so um yeah. You can't do that. So we need to make sure that we're compiling for the the correct back end all the time.

B

Okay, um yeah. We can also specify the arch okay, so uh I didn't specify the arch there. Okay, so that actually means that the the binary that was compiled was specific to sm50. Okay, so that's the default. The default is sm50.

B

If I want it to be sm 80, I need to specify it okay, so under the hood what's happening so essentially, this is the same thing you have. This is a parallel kind of compilation uh streams, so this is obviously gpc plus plus for cuda. So you have your cpu object, which gets produced by your host compiler and then the device compiler produces ptx assembly and actually ahead of time.

B

The ptx assembler is also invoked to create a cuda object file and then the cpu object, the ptx assembly and the cuda object is lumped into this final fat binary, um and this is great so essentially when, when we're running our fat binary, our runtime says- or at least the cuda driver, says: okay, is this ptx compatible?

B

uh Sorry, is this cuda object compatible with my compute capability and, if not, then the ptx assembly uh will be jit compiled uh into an appropriate binary at runtime. um So this essentially means we don't really need to worry about the arch flags that much so like what was happening here when I was um when I was running uh without having specified the arch didn't specify the arch.

B

So actually, there was a device binary that was passive, the cuda device, the cuda driver, took it and said: oh actually, I can't use this, but it doesn't matter because I have the ptx as well. I'm going to compile this into sm80, um which is what's needed for the a100, and then I can still run the code and the ptxj compiler as well uses a cache.

B

So the first time you that you run it um the the jit, the jit will will happen, but the second time, third time, whatever you'll, just use the binary that was cached by the the ptx chip compiler.

B

um This is great. It makes their life easy, but in fact the ptxt compiler has a.

F

B

A finite amount of time that it's given to jet compile, because uh essentially you want your code to run quickly, whereas the ptx assembler um has a theoretically like well, not finite amount of time that it can use to do various optimizations whatever. So it might happen that offline compilation, which is guaranteed by using the correct the correct arch flags, it might happen that um the ptx assembler gives you a little bit more optimal code, but, like these things are very marginal, so you'd need to you need to profile.

B

You need to test a little bit if you wanted to test. If you wanted to test the ptxj compiler versus offline ptx assembler, you would just pass the correct arch flags for offline compilation or the incorrect arch flags for jit compilation, which is a bit a bit kind of hacky, but that's that's how it works: um yeah, okay, so yeah.

B

There are a lot of things going on here like this is kind of an abstraction as well we're just focusing on a few elements, but uh you can query exactly what's happening uh under the hood by passing the hash flag. So let's have a look at that, so if I just do climb, let's go.

B

D

I do have hash okay.

B

This will tell me all of the compilation subprocesses that are kind of that that happen as soon as I compile anything okay. So that's.

G

A lot of stuff.

B

uh A lot of stuff so.

D

B

B

B

Okay, so I'm gonna need to. We don't necessarily need to uh you. Don't need to read all of this, but there's a lot of stuff happening here. Okay, so occasionally useful.

B

Certainly in our work we're trying to figure out, you know, what's happening, what's going wrong in various compilation process, so we use this all the time for day-to-day use um if you're just implementing things in cycle, maybe not that useful, but good to know that it's there, for instance, if you look the ptx assembler, so this is where the ptx assembly is invoked and we can see that it's for sm50, okay, because I didn't specify a good arch, okay.

B

Okay, another really useful thing to kind of figure or to to get intermediate files from the compilation process which can be really useful, especially if you're trying to say compare ptx generated by dbc, plus plus versus ptx generated by say nvcc. um This can be. This is really neat. If you want to do, it um is use safe temps. This needs to be called from within an empty directory. um So let's have a look at that.

B

Okay, so uh I think the top was for previously. Okay, so I'll just make the attempt.

B

Yes, you need to make an empty directory doing search and then essentially I'll set it up again. um Okay, so we're gonna save.

B

Temps and I'll need to provide a path to the test file.

B

Okay, so this was an empty directory and.

D

B

Okay, so we can see this yeah, we got our a dot out. We also got um all of our intermediate files. Okay, so things like, let's just say, for the host code, we got our pre-processed file, we got our bit code llvm bit code or bytecode. It always makes these up. um We also have our x86 assembly.

D

B

That's just for host code, so we have all this for host codes, footers headers, uh so on fat, bins, yes very nice um and then for the the cuda, the cuda back end. We have our cuda object. We have our cuda bit code, so we can see actually the the ptx has passed through this lvm layer uh beforehand, so we've done llvm optimizations on the on the device code as well before we actually um turn into ptx and then the ptx gets optimized again.

B

So, theoretically, you know it's being optimized by two separate things that theoretically it might be more optimal. Who knows? um uh We also have so we're interested in our ptx, which is usually in a dot s file. So we'll, let's open this up and see. What's there.

B

Now it's not every day that you'd necessarily be using this okay, so here's our ptx um target it can be really useful if you're, trying to benchmark things, uh compare performance between, say, cuda and dpc, plus plus for normal use, may be overkill, but uh good to know that you have these things as well.

B

If you think that there are bugs that are due to a particular process in the back end, you can get your ptx file pass it to ptx assembler manually, and you know figure out if, if something is working, not working, etc, okay, so specifying the device at runtime. So this is just using our single device filter as we saw okay and then that's everything for this slide questions.

H

Can you repeat the one for uh the flag for uh requesting the specific cuda architecture? Yes, yes, yeah. Okay,.

B

So there are actually two flags, so you need to do exicle target back end and then cuda gpu arch is equal to sm whatever. So for a100, it's sm80, okay. So actually, if we do.

B

So, let's just do our previous, um so our last um ptx was for sm50 because we didn't specify the thing. The the architecture, but if we do x, tickle target back end target.

B

80., okay, we'll save temps and we'll see what the um what the the ptx is like and it should say: sm 80 we'd. Imagine, let's just see okay, so this is a whole new set of things.

B

D

Sm80 very nice, okay.

B

So again, whether we want that whether we want uh our device binary to be produced ahead of time or at runtime by the the jit jit compiler. That's up to you. um I think yeah, probably performance differences would be. You know, marginal to say the least.

B

Okay, um any other questions.

H

I have one stupid question: um can sql be used with fortran.

B

Not a stupid question at all. um Consequently, used with fortran sickle is um used essentially as a we. We rely on c plus language features, so it is um it's. It sits completely on top of c plus plus, it's a part of c plus plus. The only thing that makes it different is essentially the back end how it interacts with um you know, whatever uh back end, you might be targeting, so it is a purely c plus plus language. uh You can't there there isn't a fortran api for a sickle um but yeah, maybe.

G

I could probably jump into.

B

Yeah, this is something.

G

That we've kind of talked about before and while sickle doesn't have any kind of direct interoperability with fortran and sickle, does provide, uh what's called like a host task, which is a feature where you can run um arbitrary c, plus plus code within the single, dag and scheduling bag, and then from there uh you could potentially uh invoke uh like sort of interoperate with with fortran code, two sort of like standard sort of uh cabi interoperability.

G

So while it's not something that sql provides, there is a potential group there, and I can't I've not seen people do it before. But you know, we've talked about it being theoretically possible.

C

D

C

I

uh Yes, I have essentially two questions. First is if I have an existing pure c plus plus application, could I just compile it with, uh let's say the sql compiler and hope it will just work, even though it will just work on the c on the cpu.

J

Yeah yeah like if.

B

If you have uh cpu code, you can you you're just using clang um or you're, using like the the llvm infrastructure, so you don't need to use like sickle necessarily, you can just use it as a c plus plus compiler and it'll work, but it won't it'll purely work on the cpu in the exact same way that it would work with any other.

I

Compiler, it's fine! It's just when you want to board stuff. You want first to start with something that works without any changes and move from there on. So I was just wondering if I can do that, and it looks like the answer is yes, which is great absolutely absolutely. uh Maybe this was partially answered before, but uh if I wanted to interface with something like uh cuda fft, uh is that doable reasonable?

I

B

Absolutely yeah so um sickle and dpc, plus plus, as well um offers like a lot of interoperability apis uh whereby you can essentially write native cuda code and uh we're not going to be covering that today, it's a little bit out of the scope, but this is something that's completely possible with um with sickle with with dpc plus plus as well. You could yeah you can write completely native cuda code in these kind of interrupt tasks which is yeah very, very natural.

B

So this is a really easy way to port cuda code into sickle code and then maybe slowly uh modify them to kind of more sickle-leaning things.

H

C

Yeah, I can quite need to uh to an example of how that's done. This is some examples that we could point you to.

G

I've followed on a little bit from that as well and as well as dpc plus plus one api also provides uh a series of libraries so things like one dna, one mpl and I think there's there is one coming from for one difficulty.

G

um I think that's not so we're we're also working to try and support these with uh cuda back end as well, so the qft isn't available yet, but that's kind of in the roadmap for the future. So.

G

Something to regret from.

E

uh Hi this is bryce.

E

uh Are there plans to support uh multiple gpu back up there, so if you're wrapping, would it also work with uh one api mkl? I guess you don't have an amd backup.

G

Yeah, so I think the so these libraries all all like the back-ends for uh intel platform and we're working on supporting these for the gooda back end. So far, there's there's a good amount of support for one mkl um and one dnn we're actually in the process of adding additional support for one dnn and then we've got kind of we we're kind of keen to support some of the other libraries like the cool, fft and uh qdpl. Hey, sorry, ben one dpl.

G

J

These are all kind of.

G

Planned for the future, but I'm not available just yet um and obviously.

J

G

Good to have these supported on on the amd platform as well, but I think that'll be a little bit further in the future.

E

Okay, um I actually have written a thin wrapper around all of the it also supports amd, so it's intel uh cuda an afd rocket platform for a subset of blasts and ssp uh that we needed for our applications. So I guess if anyone else needs something now, um you should take a look at what we've done: design github uh under g-tensor.

E

We also have a multi-dimensional thing library, but the live library wrapper is largely recognized for that, so I I'm actually driving right now, but I.

J

J

Oh, I have a question regarding how the sickle cue maps to a cuda stream behind the scenes.

B

So currently so, theoretically, a single queue doesn't have to necessarily have a relationship to a cuda stream, but so in in certain implementations, notably hipsicle, uh a cue maps to a collection of streams, meaning that you can essentially have a queue executing concurrently, which is the goal. But currently it is actually mapping directly to a single stream. But that is liable to change.

J

The compiler that you're just maps to one one to one is that correct.

B

That is correct, gordon mike, am I correct that.

G

That's correct: that's.

B

G

At the moment, the db signals was the the queue as a one to one mapping with could a stream, and we, as you mentioned, we may be looking to kind of expand that, in the future, to kind of allow for more parallelism in the schedule.

B

Okay yeah, so maybe we might go into the first exercise, so the first.

D

C

Sorry, I've posted the um the links on the slack channel, so everyone should be able to get them there, but I'll yeah I'll. Let you um talk through how the how this example.

J

C

B

Yes, so you need to clone the um the civil academy repo and make sure that you're on um yeah pearl motor workshop and then with those code exercises.

B

Let's say: exercise okay, so essentially we just want to um make sure that our we're able to compile with the signal headers essentially um so we want to include the header uh default construct queue. So this we don't necessarily need to think too hard about this at the moment, see what devices associated with that queue and then so. This is a string, um you'll be getting the info of the device's name and then maybe um maybe uh printing that out or something just so that we know what uh device that we have chosen.

B

um So yeah essentially copy this into our file, and then we can see how it runs uh and hopefully use single device filter to specify the device and um see how that changes which device is chosen.

B

B

So we have the solutions thing, so I would recommend not to look at the solutions before um trying the exercise yourself, but um it's a free free world, so yeah. uh So this is just a simple example of what you might use um default: constructing a queue get the device get the dev name and then chosen device.

B

um If people are having issues that, um obviously this is very simple code. So, if you're having issues with this sort of code, then we need to make sure that um that we figure with with a dot out, okay, sorry, okay, um yeah. We need to make sure that we figure out any issues now before we get on to more complex things.

B

uh Yeah. Does anyone have any questions about this solution?.

H

um Can we get a link to the sql documentation, the api documentation.

K

Oh yeah, of course, yeah.

H

You can close it in select, so it's there forever.

B

Yeah so there's a single spec and also the the one api um kind of you know spec as well. So I find that the second spec is very good.

B

Okay, so this is the spec.

B

B

So I think, potentially, um someone was having an issue with uh could air out of memory, I'm not sure if this has got to do with not using a batch script. If, if this is the kind of error that you might be susceptible to, if many people are trying to.

D

Run with adot, maybe this.

B

Is something that happens um so uh do we have um do? We have uh a reference kind of s batch that is somewhere in the slack or so that we could uh direct people to.

C

J

Think there was something.

C

On the original um uh deck that was shared, um maybe in our full script, uh let me let me dig out, because I find something.

G

There are some instructions, I think, on the nerf documentation.

H

um This is coming from a cuda programmer, so uh do we need to uh clean up the queue structure or it's automatically allocated, because this is c plus plus.

B

No, it's it's automatically.

C

B

Just default constructor totally fine.

H

Okay, so it's not like you have to manually, uh could I destroy something something um when using sickle? No.

B

No, no! No! No so this is yeah again uh c, plus plus has its own destructors um yeah, so yeah, the the c plus paradigm is a good rule of thumb. um So if something works, usually in c plus plus same for cycle.

B

Cool okay, um yeah.

D

B

Might crack onto the next um exercise.

K

Maybe a quick question because somebody asked on the slack: what is the status of the host us back end? Is this enabled? Can I do a cpu selector or it is disabled on and on this back-end yeah.

B

Absolutely yeah, it's enabled so so query your sql ls.

B

I'm not sure am I still sharing my screen. Can people see my screen? Yes, yeah? Okay, great yeah, sorry right, the the zoom thing is kind of desperate, okay, um yeah yeah.

B

So query your your signal, ls and you'll- see uh what's available, so you can so the the way that I think um is nice to specify the runtime device is using uh or sorry device filter is using single device filter and we can say host if you essentially use any of the words that are used in any of these triples, then it'll select the one that matches um so in this case running on sql host device success.

B

Okay, we can also do.

D

B

Okay and then that was involved binary for the the because I compiled with spear v, but I would recommend using sega device. Filter is cuda. There are other ways of um constructing cues where you statically choose what kind of a device that you want- and this is this is good as well, but again uh we don't have a well this. This is outside the scope of today's um workshop, but there there are lots of materials online where you can see how to do this. um It's very very simple: it's just that!

B

We do have a finite amount of time today.

B

F

Start to interrupt you, so here we need to specify uh ext underscore one api underscore coder uh yeah.

B

It turns out that just cuda is actually fine, just kudos what.

F

About the full full name, can we just type the full name here.

B

Yeah we can you, you don't need to type the full name. Essentially, I'm not sure exactly um how this is matching, but anything that um uh any word that will match here um in this triple will specify the right device. Okay, you could also do sql device filter gpu.

B

um This is a bad example, because it's failing because it's going to the the cuda device yeah.

F

People because I think, for most users, they probably don't know, don't know that they need to specify some some device filter before they run the binary code, so so yeah. So that's. I think it's good for for people to to be able to to to realize that we need to have um a filter to to run, to run the program.

F

That's very different from what people do with code running code or hip program, yeah.

B

So the the reason this is, it's not actually essential. You can decide what device runs in the code, but um essentially this is, as I see it, a benefit of uh sickle because you can determine which device things run out at runtime. You don't need to change any code to run it on different devices.

B

uh I think this is a this is. This is a trick if you like. This is something that will enhance the usability of your code, whereas you could actually, you can hard code, these things in your code, using um okay, just for, for instance, I'll, go to uh sorry.

B

B

Okay, so hello.

B

Okay, so if I wanted to okay, I'm not going to try and run this code, but if I wanted to, I could.

B

You can essentially tell the uh the queue what device you want it to be constructed with okay, so here we're going to use a gpu.

B

Selector, okay, great so that'll only run with the gpu okay. So, let's see what happens actually when you include our header.

B

Okay, yeah, that's okay,.

B

Okay, so let's see so we'll do uh minecraft with us, maybe we'll see if there's something decent.

B

B

Just as a very very basic example: okay, so we do um source cpp.

B

So this is just constructing eq. It's not doing anything else, okay, so if we run it with a dot out success. Okay, it constructs okay. If we specify the.

B

The device runtime that's host.

B

Okay, no device requested type available. Okay device, not found okay.

B

So in fact, if you use your standard, your default constructor for a queue, then you have far more flexibility at runtime, because you can choose things: okay, you can choose if you want to run on this device that device that device, whereas this will only allow you to run it on a gpu and that's that might be desirable, that might it depends on what you're doing, if you're, trying to prototype debug, um it's very, very useful, very dynamic to be able to swap between host and uh other devices very quickly but um yeah, and there are lots of different uh selectors.

B

You can also do host selector cpu selector post selector, these kind of things so.

F

Oh, it's a good idea, so it is so. You recommend that we just specify a generic queue without we specify a queue without specifying the selector, and then we can uh use the device filter to run a program on on a specific device.

B

It's certainly up to.

F

B

User um like this is how I use sickle, and I find that this is really really helpful, um but it's it's up to the user. I think okay, anyway, I'm you're welcome.