National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Parallel C++ (DPC++) Programming Model

Description

Abhishek Bagusetty (Argonne)
Data Parallel C++ (DPC++) Programming Model

A

From a performance Engineering Group at argon leadership, Computing facility, so uh I'd like to thank the organizers for uh giving me this opportunity to talk about the DPC plus plus in a different setting uh for Paul matter. uh First I would like to acknowledge Jeff, Larkin and Johannes to uh who has laid the groundwork for much of my talk in fact. So uh further Ado uh I'll talk about data, Barrel, C, plus plus programming model for specifically uh Nvidia, gpus and and and more practical case per Mater in this case.

A

So I'll start with a different note here, which is um sickle. As many of you have heard, uh this keyword, uh it's it's it's a specification. It's a language specification from the Chronos group. Coroner's group uh handles the specifications for opencl, opengl and and many other uh models. So sickle is one of that and I would like to emphasize that simple is not a programming model, but it's a language specification and I'll get in in a bit.

A

How sickle and DPC plus plus, are related to each other, so uh long story short sickle is nothing but, uh uh as I said, like a language specification which carries some of these features which have a similar heuristics like opencl mining. So anyone familiar with opencl, you might see a exactly same keywords appear in Signal.

A

um The the Salient feature with SQL is it's a C plus plus single Source programming model, which means that that both host and the device code coexist and when I say device code I'm not really referring to the gpus. It could be anything it could be: gpus, fpgas, dsps, Etc, so so long story short, the single C plus plus source code, runs on on on host and the devices and and- and there is a and the memory model here which refers to USM and the buffer model.

A

So I'll give brief examples later in this session about uh what are the differences between these memory models? The the biggest Advantage between these two is uh one offers a lot more control over the memory transfers and other just does that job for you. So so that's the that's. The Stark differences, unlike any other gpgp programming model, take Cuda, hip opencl uh Circle I mean this is like the fundamental feature that one would request.

A

So it carries a synchronous programming model where you just overlap, compute copy host operations and and and to just maximize the performance and reduce the time to solution uh coming to the most important point. It offers portability as I said like it. It initially uh is a single C plus plus source code, which can run on host and GPU devices, but it could run on any hardware, not just initially designed for Intel gpus, which was uh clearly the motivation, but it could run on on Nvidia and the AMD GPS as well.

A

So uh and and let's not forget about productivity, so how many lines of code you write? What is the biggest performance Gap that you see with the native programming model, for example Cuda, so I'll briefly touch based on those points.

A

Moving on uh the title coming to the title of the talk, the data panel C plus plus this is nothing but an Intel's. One. Api implementation of secure, as said sickle, is a language specification and vendors. Have the freedom to take that specification and Implement and provide tools for us, so DPC plus plus, is a programming model, which is an implementation of second, and you could view it as another way that it's just a mixture of C plus plus and some SQL standards with some extensions.

A

So these are the three components that forms DPC, plus plus so as you've heard in the morning that uh the C plus plus is getting evolved as a parallel uh modern uh programming language since C plus plus 17 ISO standards. So uh one of the the fundamental uh restrictions of SQL is C, plus plus 17 compliant as I was earlier saying that it is based on a fourth mode in C, plus, plus and and the most important feature is the cross architecture standard.

A

Moving forward uh so I just talked about C plus plus Circle, but there is a another piece called extensions. These are the extensions that the vendors Implement for anything uh coming from productivity ease of use performance. Anything like related to those- and some of these extensions were fortunately adopted by the sickle standards recently.

A

uh So the goal is all these uh implementers of sickle standards develop their extensions and if, if they prove uh to be more useful to the open Community, the llbm community adopts that and the goal is just to open source it to the nlpm Upstream and and more importantly, these extensions were closely observed by the circle or and the Chronos working groups.

A

So it's it's it's a quite a collaborative effort and and and a good feedback loop um as I was just saying that Sickle is a portable programming model, um with specification based off of 17 standards, uh backed up heavily by the industry.

A

um It's it's open source and and then it's a single Source programming model. So there are several libraries that are built up on Circle based off of C plus plus the the this is a flowchart of sim and I'll. Just make it uh simple that it just has two layers of uh compilation: workflow, one for the device, one for the CPU. If it goes through the device it chooses, a circle, compiler chooses whatever Back ends, opencl targeting all these devices or other backends.

A

You could choose Cuda hip uh anything that just targets those devices so and and and- and there is another option for traditionally uh for the CPUs as well. So this is just a simple compilation: workflow, if you, if you take a circle, programming model, I'll, just talk about uh the compilers vendor players are active in the space and and and mostly focus on the leftmost one. We just enter DPC, plus plus implementation. There are other uh active contributors such as Coldplay um hipsico, on the on the DPC plus side.

A

You just have like several uh devices each going through its own plugin, so Intel gpus goes to the level zero plugin Nvidia, with an uh nvpx and AMD with the gcn, so DPC plus plus just provides a portability layer for Circle and then targets all these devices.

A

So what's the story with the sickle Ed nurse, so there has been a collaboration between alcf nurse and Coldplay, uh but to enable uh Nvidia A1 uh cyclone in Media a100 gpus.

A

So the initial scope of the work is vastly completed, but uh there is also a good bit of tracking for the libraries as well, because any uh scientific uh Endeavor requires a good support from the compilers and the libraries to have a good uh seamless uh portability story.

A

um So yeah you could check out this module here and and sickle uh uh at nurse. uh There's a there has been a training event that happened in March, uh there's a pretty good material uh training material that one could just look at it's it's it's self-evolving uh so uh feel free to check this out uh and I'll just talk about some of the heuristics of sickle uh people who are familiar with the Cuda or hip it. It has a very similar uh equivalence.

A

uh So I'll talk about single, cues and context as abstractions, so it's a good cues are just mechanism to provide work to a device. Think of it that way and Signal contexts are nothing but like a Cuda context, which everyone overlooks at it and the same cases but signal so uh Circle Q is like a handle that you submit a job to it and then, and then this this just dispatches the work to the host or device. So uh to be brief context is nothing but like a like a context.

A

Loosely speaking, uh these contexts provides a mechanism for resource isolation or training. So whether you want to share the memory with the next GPU or or or not and and queues are nothing but like a good streams which most of you are familiar with. Hopefully- and this provides an asynchronous mechanism with the host one vehicle.

A

Cues are not only in order or out of order, and these are both in order and out of order, so you could choose either or- uh and if you choose an another Circle Cube, this exactly mimics the feature of Cuda string, which is first and first down so Circle.

A

um Like uh Cuda has very one-to-one mappings, as as uh Johannes was talking about uh some of the active development that goes into the compilers. Yes, you can build your own compiler with ease, so this these are the instructions to build on the Pearl Motor and there is already a module that one could use as well. So if you just clone build with the cmake instructions- and you will find the compiler so this this is as simple as that, but it just takes a while to build it. So just be careful on that.

A

uh The Practical question is: what does it take to Port from Cuda to sickle uh and and I'll briefly, touch upon that Cuda has or sicko has all the same features as a scooter and and- and there are like, like significant portions of it- have one-to-one mappings.

A

So um it's just it's just as I said like. If you see these keywords, these keywords have very familiar namings with the opencl as well. So people who are familiar with Cuda will be very easy to adapt to the sickler as well and and and what is the motivation behind adapting to sickle? Is it's a portable programming models right uh I'll, just briefly touch on the subgroups here.

A

Subgroups are nothing but uh warps included, so people who are very familiar with the Kuda warps and and what it offers it exactly offers the same features so uh so: SQL top groups, maps to Cuda warps on the Nvidia side and on on the way front side on the AMD side.

A

uh So uh so the memory model is again very much similar to Cuda as well uh register shared memory, Global memory, if you're familiar with all these terms. It's exactly the same. So there's the learning curve is quite small when you venture into sicko uh I'll just show us a simple snapshot of how you allocate memory in Cuda versus how you allocate memory and and and and Signal most important thing here is: all you need to change here is the GPU, which then runs the entire battle code on the GPU.

A

So if you choose, if you replace GPU selected with a CPU selector, it would run it on a CPU. So that's the way uh it touches uh the different Hardware so either you could do it in the compile time way or you could do it in the runtime fashion as well.

A

uh So this is like a very textbook example of circle that I just want to show it to you. um What does the workflow look like and and I just want to briefly touch on the buffer memory model that I have introduced at the very starting? So you have standard headers.

A

These are the data structures in siku, so you have a b and c which are on the host. You just wrap these into SQL buffers. You create a queue, because this is the queue that you use to submit work to a device right, so you use these buffers and then and then just run the paddle four.

A

You need to wait. This is exactly same like a synchronization point in Cuda, and then you just print that. So that's the simple workflow and it just boils down to like host code device code and device host code.

A

um Moving to the USM memory model, which offers a slightly familiar feature because buffers are like quite complex and too much code, if you see USM model, you would see all these pointers which you are familiar with so USM models is nothing but a pointer-based model. So you have the data structures, allocate memory, mem copy, launch the kernel copy, the results back and print that results. This workflow is very similar to Cuda and any other gpgpu programming model. So this is the memory.

A

This is the model that was started as an extension in DPC plus plus, and now it became a part of the standards.

A

um I'll just skip uh this, in the sake of time um talking about performance benchmarks. So these are the benchmarks that were carried out in collaboration with Coldplay on Pearl mutter.

A

The question is, how does Cuda compare with cycle so, as you can see that this is the Bible stream, a benchmark that is heavily used these days on different Hardwares different programming models, you can see in most of the cases if you compare, blue and and orange it's almost very comparable and and and lulish is, is one of the the the mini apps that you just talked about uh from Jeff Larkin, um and so you could see the difference in the performance between good and second, and this is quite old Benchmark, so it would starkly get improved um and and similarly with RS bench and and d slash.

A

These are some of the other Benchmark assessment, as you can see that sometimes beats Cuda performance and and a very similar performance. So uh these are some of the benchmarks that we just carry out and and and and and- and these are and I should admit- that these are quite old and needs to be rerun um and and the story would change quite a lot, uh because there has been quite a bit of improvements uh coming to the Practical aspects.

A

I, don't know what's happening here. Okay, the question is how do I Port an existing cool data cycle, so there is an open source tool which ports a Cuda project, I'm, not talking about a single file or a kernel, the entire project to sickle that which could be deployed on different Hardwares. This is open source feel free to check this out. There is an additional resource if you wanted to see what are the equivalents?

A

Encoder look like in sickle, obvious practical question is: if I have mat libraries, what's the story behind that, uh so one MPL is, as a part of DPC plus plus one MPL is nothing but one math kernel Library. This works on multiple backends, so for NVIDIA you could use one mkl apis, but it just piggybacks on the standard, Cuda libraries. The same goes for AMD. So it's the same performance. There is nothing different in the performance or a performance. Hit That You observe.

A

And and that's pretty much and I'm happy to take any questions.

A

I think there's a question in the chat yeah.

A

uh Okay, are these benchmarks using buffer or the USA model?

A

So, okay, these benchmarks were performed with the USM memory model, because uh buffer model is quite outdated by now, I would, if I, if I, should say uh yes, I, I, I, I, I, I, I, I I do agree that it's it's it's very bad to not show the plots with the that starts with zero and I apologize for that uh for this Village performance.

A

A

Yeah, uh yes, I agree, uh so the the discussion is, uh it would have been a much better uh showpiece of performance when it if it starts with zero, because much of these are quite comparable for delish. uh Any other questions. Sorry about the the background noise.