National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Intro to SYCL/DPC++ for GPUs

Description

Jeff Hammond from Intel presents a talk on Intro to SYCL/DPC++ for GPUs. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Oisín Creaner

A

My name is jeff hammond. I work for intel uh in the hpc organization, um I'm going to try to give a short and high level talk, um I'm a big fan of slack and I I'm on the nurse slack every day. So if you have one more details, I'll put it there so that it's in writing and then it'll be saved forever. It's maybe easier than doing verbal stuff, um but hopefully there's time.

A

So I'm going to talk about sickle and dpc plus plus, and I will explain what the difference is sickle is this chrono standard and dpc plus plus, is the intel implementation, uh and I have details on this and I'm focused on gpus, although uh neither of these things are specific to gpus and I'll show more on that later.

A

So um you know I I used to be a doe. I've been around for a little while um and one thing that I find particularly interesting about exascale uh system architecture.

A

You know uh ten years ago give or take you know there was this two swim lanes. One of them was sort of blue jean-like and the other was basically nvidia like uh and that's that's sort of how the world was oriented and we've seen we've done uh quite a bit of a rotation uh in some respects.

A

uh You know, there's uh not really a many core cpu, although that's debatable, because the the cpus in both of these systems will have dozens of cores, uh but it's not many core in a blue, jean uh xeon phi sense and of course neither the exoscale systems uh at oak ridge or argonne are slated to have nvidia gpus, although of course promoter will uh have them and other sites will as well.

A

So uh you know this is this is actually kind of a pleasant surprise uh for those of us who have been advocating for standards and portable portable programming, because uh if you were planning to uh run your exoscale application on a mini core cpu or an nvidia gpu using something that was uh vendor specific, uh you are probably sad now, um but if you were focused on something that ran on a lot of different machines uh like cocos or openmp, uh then you're probably uh fairly happy right now, so I'm going to talk about uh another option besides cocos and openmp for uh portability on some of these systems.

A

um So sickle, like I said, is a chrono standard. There is already an ecosystem for it. There are three different implementations that are relevant to gpus, uh so the first one you know I'll mention is is intel's data, parallel c plus plus compiler.

A

uh It is based on clang llvm, there's an open source version on github the product we ship in one api is derived from that open source. uh Quite directly, there's you know, the differences are basically different, git, hashes and compile on different days of the week. um We have some gpu extensions, although um actually as of literally today, um all but all of our gpu extensions are actually part of the sickle 2020 provisional standard that was uh announced today.

A

So I'll talk a little bit more about that later, so the dpc plus plus compiler supports intel, gpu, cpus and fpgas. Obviously, we implemented that codeplay, which is a company in inbro, uh contributed support for nvidia that that's available in the open source version. You have to build it yourself for cuda licensing reasons, nothing. We can do anything about there's also codeplays compute cpp product, which is a different implementation than intel's they're, both based on klang llvm, uh but they are d.

A

They are different um and that's great because, as folks know, if you've ever debugged a clang bug or a vendor compiler bug it's wonderful to compare it to gcc and know that you know there's two different code paths and if they both give the same answer, then maybe your code is wrong and if they give different answers- and maybe one of the compilers is wrong, so code plays compiler supports uh opencl spear v devices, of which there are a few and I'll show that on a later slide, um their compiler supports our gpus among others, um and they also have a ptx backend uh for nvidia.

A

So there are both of these first two support, um both nvidia and intel uh and any other device that supports sphere v and opencl, uh which is unfortunately not that many um heidelberg uh university produces something called hipsicle. This is um by a fantastic uh young fellow named uh axel alpe, and it's based on clang llvm and it's based on cuda clang and it's related to or uses the same code as hip, um which is in the name.

A

Obviously uh so hip sickle is, has a cpu back end that uses openmp and then video back end using cuda clang and an amd backend using the hip rocking stack. So you see here, uh obviously all the gpus um of interest and actually arm gpus are supported, not that that's particularly relevant unless you're going to do exoskill on your cell phone, um but there's a nice healthy ecosystem of support for a lot of gpus there's.

A

Also a compiler called tricycle developed by somebody at xilinx research that supports um not gpus so, but you can look up the details it I use it on my laptop all the time I I want.

B

A

Talk about performance portability um first, and so I'm gonna cite two different uh results from I wackle um by researchers, not intel. um That makes it really easy to cite third parties when we compare vendors uh makes my life easier. So you can see in the link here. There's there's youtube videos there's the pdf online. It was free last time I checked um and the code is online, so you have full freedom to explore, reproduce, etc. um You can see here on the right. This is just the the babel stream triad excerpt from the paper.

A

They had some other numbers, but stream is uh stream. Triad is so well known. I figured I'd cite that one and you can see here, starting from the right on the amd gpu you can see that which, which generation etc. In the paper for details, but on the amg gpu, you see negligible difference between opencl, sickle and hip. um You could see on nvidia small difference. You know five percent um from sickle relative to opencl and cuda. um I don't know all the details.

A

You know there are occasionally run time and code generations differences and sometimes you can soften the soften. The differences with with minor tweaks, if you're, if you know how the tools work um on the intel device, you can see, sickle and open seal, are essentially identical and that's they're, basically they're the same they're, the same implementation behind the scenes. It's just sickles a prettier front end um as I'll show later um the scale here is, you know, notice it's not zero at 60., um so you see there's a 25 difference on xeon.

A

This is a compiler bug. Basically, um we know about it. It'll be fixed at some point. It has to do with the opencl compiler doing codegen differently than the openmp compiler, uh not something intrinsic to the programming model, and I suspect that one of the other sickle compilers for xeon um would not have this issue. So this right here, you know, if you care about memory, bandwidth, it's nice to know that uh you can get. um You know very, very close to performance, build portability and bandwidth um with with a bunch of different programming models.

A

So this is uh another paper from my wackle about a month or two ago. This is from argonne by brian hommerding and john tram. They have results with roger perf and I think it was um the access bench I think was in there again. It's all online, you can, uh you can grab all of it um and I won't try to go through all the details because they're in the paper, but roger perf is sort of a um suite of kernels uh that are relevant to the nnsa multiphysics uh workload.

A

You know, hydrodynamics also includes some simple stuff, some complicated stuff, a variety of different kernels, and what one thing that I think is interesting about this, and I don't even remember which is positive and negative on the red and blue with sickle relative to cuda. But you see it turns out for reasons that um one would have to diagnose with assembly reading. um But you know you see winners and losers on on on both cases. So it's not you know, monotonically cuda always wins and you always pay a price with sickle.

A

um Obviously, I'm sure somebody in nvidia can make make cuda always beat sickle if, if they play around uh with with the you know, tuning the kernels, there are some differences in how the stores are generated. I think in some of these kernels, but this is another example where you can get. You know: modest performance portability uh with with interesting code uh using sickle on a gpu.

A

So um talking about the different languages, so what is data parallel c plus plus? So this is this? Is the intel compiler implementation, including some extensions? um It's based on sickle 2020, which, like I said, was just released today um so sickle one two one is the thing that's been around for a while. It's that's what's broadly implemented um today, sickle 2020 is supported um approximately all of it. I don't know the exact details by intel. uh Codeplay is uh getting there. I don't know the full feature, compliance.

A

um They may do a new release and then be be up to date in a couple days. um But you know we had these these extensions on top of sickle, one two one for most of the last year, uh but per our commitment uh to the community.

A

We we tried to standardize all of them and- and we were successful with all the gpu extensions and the extension that didn't make it as related fpgas, and I don't know that it was intended to go upstream and was rejected or not, um but uh you know we expect the standard compliance sickle code to always be sufficient on intel gpus, but we'll add extensions uh when users need them.

A

For example, one of the extensions we we submitted to sickle 2020 called usm, which is unified, shared memory, same name as openmp5 um that's I'll show later, is basically gives you malik and pointers, and we did that because our our friends in dui working on cocos etc, said well.

A

We we need pointers, we need, we need, uh you know, cuda style memory management in order to be compatible with our design um and that made sense and it made sense to the sql community so that now that's standardized and you know in our case all of our extensions.

A

Both the documentation and the implementation are available on github there's nothing proprietary about them. Anybody can re-implement them um and anybody can port the compiler to any other device. So this is it's. You know it's opening in all the different degrees of openness. Obviously we want to have everything in chronos, but we'll also get uh you know open in the in the sense of hey it's on github.

A

uh You can see what we're doing so, why sickle um so opencl is, has been sort of the the portable open standard for gpus and other devices for a long time, um and there are some good things about opencl. uh It's it's portable and that's makes it better than a lot of other things out there, um but it's it's got some warts uh people often complain that it's it's too verbose. um It's maybe you know the difference between mpi and uh upc or co-ray.

A

Fortran um people don't seem to mind mpi, but they they minded opencl for whatever reason, but but the big reason that opencl was sort of not the right place to go is that opencl does not have holistic um c, plus plus support uh and modern c plus, is really an essential, essential thing for for modern programming, and we, you know, we know nvidia is very strong on c plus, plus uh stuff and, and you know, we're big fans and there's a parallel stl.

A

So there's all the c plus plus stuff going on out there and- and it really was important to to make sure that c plus plus was was a you know, first class citizen uh in this model, um and so you know, sickles based on modern sequels plus build first spec was based on c plus plus 11.. uh It's now, based on c plus, plus 17., so includes all the good stuff.

A

um You know ctad and all those other fancy acronyms that people like um you know if you like tbb or you like uh c plus plus stl, then sickle has a lot of the same concepts there. And so you can, you know, becomes a natural thing to port. Over, I will also say: um sickle is the closest thing to sickle. I've ever found is, is coco, so if you're comfortable with cocos um you'll find, I think sickle will be quite comfortable as well. um One of the ways I discovered sickle before intel was doing.

A

It was actually as part of a sort of uh cross-industry analysis of modern sequels plus models of which you know cocos rogers tbb pstl sickle they all they all sort of showed up as um as interesting things that the people might be wanting to use- um and you know sickle is really the first standardized programming model to to take on heterogeneous programming with modern c plus plus you know, cocos gocos is certainly open. um I'm a big fan of it, uh but you know coco says it different.

A

Has a different sort of you know: notion of openness than than than kronos does um and, and there are there are pros and cons of each, but but we wanted to make sure we were building off of an industry standard that that would be widely implementable when we were doing our gpu software program.

A

So this is the ecosystem. This is pulled off of the kronos website, um showing all the different implementations and all the different devices, and you can see you know, depending on depending on drivers and whatnot, pretty much everything is here. um You know all the cpus because of llvm um all the gpus that I know about including arm and whatever power vr is, I think, that's a gpu.

A

I I haven't personally verified anything with xilinx, but I have run you know our sickle compiler on um rfpgas, and I know that works. uh So that's that's pretty much everything and this is cool uh to me. um If you look at all the standards out there, any programming language, any programming model for accelerators um sickle actually has the most device support of anything that exists. um You know openmp and openacc do not support fpgas.

A

um There are, of course, other other software models that that are supported on a subset of these things. If you, if you know of anything that supports more hardware than sickle, please do let me know I'd be curious to know about it, so um I'm gonna go real, quick through some syntax. um What's what's my time like my timer didn't start at zero, so I don't know how far I'm in.

B

And so you're about 16 minutes in if you can wrap up pretty cool.

A

Okay, I'll go fast. Thank you! um So I'm not going to belabor this, I'm just going to show opencl, verbose, tedious, bad um sickle same same level of expressiveness still a little bit verbose. um This is sickle 2020, so you get usm. You eliminate a lot of syntactic bloat.

A

um This is this is pretty nice um and then, but there's one thing here: you might want to flatten out- and that's this, which is this is currently not standardized, but it's literally just syntactic sugar and a header file, and the thing is: if you look at this and you compare this to you know something else out there like cocos, um you know the syntactic expressiveness is is is pretty much there. um One thing: that's nice. Is it's fully asynchronous? It's not always available in other models.

A

um You can see the coco slack for details on that and I may have used too many characters here, because I I'm I'm not the best c plus programmer um and all this stuff. I can give example code anytime that a lot of this stuff is on github somewhere. So um yep, that's it. Thank you.

B

Thank you jeff, uh so we had a um comment there uh in the chat um which was about the lack of fortran support and um I suppose you're wondering if you could comment on that and whether you think that yeah.

A

um So first thing is I I, I love fortran I've written a lot of fortran in my life in nw chem. uh We believe intel. You know that openmp is is a fantastic solution for forger and programmers as well. As um you know, c c, 99, purists and c plus 03 programmers um and you know we're supporting openmp on our gpus. I just didn't talk about that here, because I had 15 minutes. um So uh that's option. One is use openmp.

A

uh The second option is interlanguage programming, so they're, you know using a modern c plus plus api from within fortran um requires some care, but it's no different than what one would do with cocos. uh The fact is the fortran compil fortran language and compiler infrastructure.

A

Doesn't let you glue stuff onto the language? The way c plus plus does, um but that's why we have openmp. So um you know I have actually talked um on the merits. One way or the other of openmp and dpc plus us for fortran applications in it, um which one is a priority for people will depend sort of on what their, what their comfort level is and what their requirements are.

A

But um I think the this standard answer would be if you have a fortran application, um please keep using fortran and use openmp for gpus effectively and and you'll be happy.

B

Okay and um one other question before we go to the break, uh which was a commentary that there was open, acc support for fpgas. Again, I'm not sure whether that's necessarily a question you have an answer for, but.

A

I I I don't know which compiler it is. I I guess put a link in it and I'll download and then see if it works on altera products.