National Energy Research Scientific Computing Center (NERSC) Introduction to GPU Training, February 2020, 14 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Intro to GPU: 05 Programming for GPUs with Directives

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so I'm gonna talk about direct based programming, you've heard jack you have heard max talking about this is like in the middle, it's not as easy as using a library, but also as not as not as hard as using lower level programming. So first I want to think lots of people that I used materials from especially IU I've got permission to use the invidious openmp training materials. So there's some slides and some of T, especially the app the hands-on code, the plastic equation.

A

It was Jacobi silver, I've kind of follow, with their open SEC code and I, convert it to open MP. So, just like, we can see some details and there are other Tim maps and Simon's G programming GPU with open MP tutorial at SC, 19 and there's my probably not not read the list, but other topics are what's new in open, MP 5 from my coke lab and then Chris Davies here his slides have lots of his performance data and the Nvidia bootcamp.

A

The there's also most recent, the OEE CP buff as well so I'm. What I'm going to do today is like not a separate open. Sec, followed by the open, MP talk, basically, I won't try to like mix them together and because they're, so similar lots of concepts are equivalent. So that's what I'm trying to do here today, so first CPUs versus GPS.

A

What we want to do is we offload something to GPU, but not lots of them, especially maybe only very small, probably like lines wise but make sure GPU is busy and lots of work to do and keep data in there as local as possible. Don't don't transfer data, it's very expensive!

A

Ok! So here are one sample open, MP code and one simple open, ACC code. These are directives, I'm going to tell you what's directive later, but they are like three lines added to each of the original source code and compiler may enjoy it if doesn't recognize it or not, enabled which opening key or open ACC support. So these are called directives and program, a pragma for for c c++, dollar van for open for fortune, an advantageous for directive based proverbs terrorism. Our first you can do incremental programming.

A

You find your hotspot, you add some directives and check progress, check, correctness and then repeat, it also allows you to maintain single source, sequential and parallel programming use the compiler flag to enable or disable and there's no over major override of your sequential code. It works for both CPU and GPU, and it's very long learning curve, because you still in your familiar language, programming, environment, C, C++ or Fortran, the hot you do not have to worry about lower-level hardware details, the compilers will hide it for you. It's like multiple.

A

It also helps you to port to different architectures.

A

So it let's talk about this. Is we call this device execution model? Basically, it's it's host centric it's host here and then they work rock it. When you do the offload, you know create environment on the devices and then it start to map date out there and an awfully work to there and when it's done get data back to the host then can destroy the environment on the device there's a like CPU, usually it hosts, and then you can have multiple one or multiple devices.

A

Usually they have separate data environment, except when we talk about manage memory or unified shared memory, but mostly they're separate.

A

So this is a like a diagram for the a host and multiple devices with OpenMP target offload target construct you get onto the device and then and then on this device. You would do a teams construct to create a team. That's only a team of teams.

A

Do you call a league of and once reading that that becomes a master thread of your team of threads later, but you have to do distribute if you don't do distribute the that all the teams were too redundant work in it when you distribute and then you have a little team of struts and then with that, and you would do the payroll work more with the power of four and Cindy and something similar to what you do on a CPU side.

A

Pair of wisdom, D just remember: there's no synchronization among the teams wetting the tea each team, you can their synchronization. So what we do the target is the program our OMP target, and here teams distributed parallel for SMD. We recommend it to do a whole bunch instead of separate the compilers will work. It work it out because each company, the reason is that because each compilers actually with three levels, they choose different levels here to optima to to paralyze. So this way you cover everything, here's a little diagram. How does it mean?

A

So here we see like target, so you do dot. There's a data outside we'll talk about data later basically OMP target, and then the teams you could have number of teams here, but now without before distribute everybody would do all the same. You have league of teams, so it once read each in each team at this point are the same: now you have to distribute now the stir. Just this loop is going to be divided by multiple teams and then each of them.

A

When you do parallel for Cindy, then each of the you create more read more vectorization for each team here very, very similar to the open sec type. So here what open acc has is instead of target, you have parallel and then again you would, you know, offload to your device and then here you kept again well say: okay, CC can. But if you don't at this point just can here, you will still have that everybody, you procreated more when on Morgan's, but then they will do much and redundant work as well.

A

Then here you do and oh a CC loop and then loop gain or loop worker loop factor. Then you have more distribution and then also more paracin at the working vector level as well.

A

So what you do is pair of programmer ACC parallel and here you would do ACC loop instead of ACC loop can work your vector or if you have multiple loops, you would take ACC loop gain here and then the next in inner loop you see ACC vector for more hands-on tuning. You would do that, but at the beginning level just do this. Let the compiler choose it for you.

A

So here are some sample open ACC codes, the one first one is ACC, parallel with it with that without the loop, this whole loop was be done by everybody redundantly, and then you have ACC loop or ACC. Loop game worked a vector. These two are equivalent and in some level.

A

So the loop- it's probably it's available in open, MP, open ACC, and it's already in the opening of open, MP 5.0 standard. So you will see the implementation coming out soon and it'll make your life a little bit easier as well. This is the very similar ways you can do it in the open ACC in open, MP, environment.

A

Okay, now, let's talk about the syntax, open, ACC and open MP. We talked about a programmer or a dollar a pound ACC and for fortune you would have ant and not every end is depending on what directory is not every end is required. Some of them are optional, but for our foresee, there's no no end part it's by curly, braces or depending on. If it's one line thing you don't need so this wooden informed that decision you know a directive with openness to see or a director with open, empty yeah.

A

So if the some compilers without, like you, have to do f, openmp Fujiko new compiler, without it it won't. You know what basically treat this as a comment, so it will not turn on open MP. That's how you can keep the sequential code in parallel code in one piece.

A

The slide I borrowed from the open, ACC training materials like try to see how many cans are created with the parallel. Just just to give you some visual thing. So now you add it all without ads the loop, everybody does the whole loop. Now you add it open a CSV and a pragma, a CC loop. Then you create it now. The loop is distributed and then without and also underneath gang level can work.

A

A rector thing or kicked on them do well in parallel like this is also try to give you the gain worker vector worker thing again.

A

So, like I mentioned about outer loop, it is again worker in the loop to the vector is helpful because you want most inner loop to do the metallization.

A

A

For come for, open, MP, I can range, and there are just three levels: teams and parallel for and 70, but look at the list up there. A bunch of them have basically used only two levels: teams and parallel ignores MD and for Cray compilers. They used to ignore pair of four four four CCE eight, but we see ce9. The C is now clan. Our VN paste so follows our VM protocol.

A

Now teams in parallel, but CC e9 Fortran, is still the classic fortune ignorin and the most rats, and then the Intel and LLVM current 11 is in under the plan. They will try to do all three levels. So what we that's? What we say we, let's let the OMG target teams it should be, or the parallel 470 the whole thing. There are some caveats that some of some of your algorithms might not fit so well. You have to separate them, or sometimes you want to collapse. There's like things in the in-between.

A

You just cannot do that. It's okay! You can't separate them. It's just that your thing will work for your application on this platform. It just probably won't with this compiler, just probably won't port reportable, as used with this combines approach.

A

You just want to list some of the hardware and software mapping, so if you're familiar with, say Kudo OpenCL or the hard word, you know what they are at least four open and CC and open and P you can say again is team, work is read and vector is simply sort of surround, or we could say parallel for for the openmp and CUDA. If you learned that you would have you heard of sweat, blog workers right and WAP, so we said we recommend you use just a CC loop, but we want to do it.

A

Make sure your inner vector lens is a multiple of 32. That's the size of a WAP. You want it fully utilize that.

A

There so now I'm going to just walk through a laplacian equation that the Nvidia people provided was open. Sec I just want to show you with the open SEC and it was open, MP and with all different compilers and how how we can can just solve this problem. It's not like extensive, optimization, there's a more things they can. You can tie or can loop all these things not being applied just at this level, there's a data region being considered, but let's start from beginning and all these codes are also in the in the hands-on session.

A

So you can try them as well by yourself.

A

So this is a C code. I did I, say this basic physical meaning. First, basically, you have like a grid and do a pick in iteration and then each in the middle of the grid. When the next time step it's like average of a boundary for so then you have the new whole grid, and then you do your error checking compare how this one is has been were converted to the to the last time. Isn't that either you reached your maximum step or reach the iteration convergence criteria?

A

The problem is solved, so the source code is like why our error is bigger than that you have to continue or if your iteration count is still less than your maximum you'll continue. And then, in that you would do a calculation, stop checking error, checking error, and then you do a swap your the this time step to become the new one become too old, and then you can't get calculate the new one again.

A

Okay, before we go on I want introduce to like normal classes. We often use in OpenMP, but they are also. They also apply to open ACC. So they can. We can't we can't directly just apply it to this example, without worrying about like more complexities here. So one is a reduction. One example on the right side is that if you do, you know average in a loop and you basically add them together and basically do a summation.

A

But this is like not paralyzed without reduction, because it's a loop index dependency, so we have this class reduction and then and it basically it would. But it was each thread it would actually hold on. It's all oak awesome and then add them together by implementation. So we want to use the reduction here, actually why we need that in that application and then collapse if you have multiple loops and when you do the class, especially if your outer loop or it's too small, then with the with the clap.

A

So you can have a much bigger iteration space to work with that can be distributed among threads. So we we did that. Now we go back to this application here. You can see that the reduction is needed here, because this max the error every time you want to do what you get a new era. You do a max with something else, just like a reduction of max, so we do that reduction and then an original example. You have like two levels or SEC loop and then effort I won. You can also have a sec loop.

A

I tried with this reduction and collapse. It seems to perform better, at least for my example, so I try to do that and it end for 404. Openmp is also, and it helps to do not just outer loop.

A

So this is the one. We call this the parallel implementation without anything else like data yet, and let me just introduce one concept: the CUDA have managed memory. Just this allows you to actually the the compiler will help you to manage data like as, if now, the host and and and advise have shared the same memory space that you don't have to translate yourself. The compiler would do it and there's now it'll save you lots of data transfer back forward, stop it for the open ACC.

A

It only exists in the PGI compiler, with as a compiler flag. You turn it on and for openmp with 4.5. You can do it manually sort of I treat it as a wrap with a wrapper to do the regular Alec was the cooler manage it could help manage, and then you can't you can they get your?

A

You know your data as if they're on the same in the same thing, physical space for open in p5. First, you have your hardware to support this. You can check with NVIDIA SMI or something yes, it's supported, then, once the implementation actually supports. This feature called requires unified, shared memory, and you just add that I don't think it's available anymore yeah, but yeah. So this is gonna be much easier. We need to do it next time when it's on the market. So now here, let's try with this managed memory without a wizened.

A

Without so the compile was put PGI as PGCC, and then we wanted some optimization as fast level. A tesla, CC 70 and this mu 4 gives you lots of output. To tell to you whether I'm doing my you know how am I gonna generate my accelerator code, whether I am doing it whether I'm paralyzing it. What I'm getting you know somewhere using my vectorization, is such a what block size of sweat, sighs I'm using you get all these, however, was this pair of implementation? It immediately compilation fails with the C compiler.

A

The reason is the C is a dynamically allocated and doesn't know the size. If I treat use this on a fortune on the Fortran implementation, it works to come it compiles so now we do it and managed it doesn't matter anymore. It doesn't need to know the size and runs, and it's not bad and memory copy is like almost zero for this. If you manage it, it there's. No management data need to copy.

A

So here's a few of the data causes we start introduced now without managed memory. We need to add data causes to the peril. So what are? They are a copy, and this one of them is to say I'm putting all if CC and openmp together, one of SEC is called copy and an opening piece called map. There's two and two from is next slide- is a table to show how they are equivalent and c. C++ is like starting and the lens, and fortune is starting and ending indexes.

A

Okay, so this is the table. I showed the copy copy. It's the equivalent of map to and from and openmp to, and from is actually the default, but you can do just two or just from two means from host to device from means to from device to host like map here, and that and then copying copy out. Where is map to and map from and create for open ACC is Outlook with open MP, so you can Outlook on a GPU, but that's not copy like treated as temporary data on the GPU and there's a present.

A

You can check if it's already there, then you can save some time and ok. So now we add data, because you knew that when you on host that the data is allocated on a host and then you're using it on a GPU, because you do ACC peril, I prog, my ACC peril, so you need them to be on the device you do copy for a you know that you're gonna putting their give them the device initial data and then when it's down you want to get it back to the host.

A

So you want copying an out from copying and out and a new is as well, because we have our initialization data there. We, like I, said we added a reduction in the collapse. So so now we have the data. Let's see remember. We also have a bigger outer loop of iteration. Why Oh big loop? So this is like data every time in a in a loop.

A

So yes, it's very, very bad, like 200 200 seconds before was the manage data. It was like about one second, so the reason is so you can compile it says. Oh, you know, I'm kind of gonna generate this copy. It's not presently already there, so it actually does. So. This is the output. When you run s around something you get data, but there's also a simple thing and the prof or insists you can run, which is the Nvidia provide profiler that you're gonna hear more about it later.

A

Basically, was that thing ain't it you get some output. That shows you actually how much time, how many times it's called sorry over 33,000 times called to do this CUDA memory htd means host the device, neither H means device. The host basically was for each max with each iteration. It has started a new power region, I often offload anything, see region, and then it treats it.

A

Oh I don't know my day, let's, let's copy it and when it's down copy at the back, so just like 99% is spent on data copy and that's why it's so slow.

A

So, let's, let's take like I mentioned earlier, so let's keep the data on it as long as possible. Let's not do it every time in the region. Let's do it outside of this while loop. Yes, because your data is like you every time step, you reuse, it reuse, it there's no need to get it out. So there's like a CC data thing. You can do map and openmp as well.

A

You could do oMG targeted data map to and from now in it you you're, you have multiple parallel regions, but for target its data is reused, so apply this to the just laplacian equation thing you put it outside of why or loop you see data, that's a region!

A

Okay, same way, you have same way you compile and same way around it. Now it's back to one second dish and the memory copy becomes much much smaller.

A

Okay, so this is the open and p1 and I I use fortune, because the fortune I actually use the Kreg compiler, which is faster. Otherwise, if I show you the client data, you see three seconds but but for the Kree Empire, it's really fast. So here is the fortunate one and with the open, empty one Fortran and then also the data region outside of the wire loop.

A

So we can do better or not exactly, but it depends on what your application are. It's if the the single, the structured data directly requires you that enter data and int and and data region enter for target enter to exit. You need between the same function. Call sometimes you know big application. It's really hard to do so. You unstructured data region so that you can have enter and exit to multiple times different places, for example for this.

A

For this example, like you have initialize array you do enter and then you have you can remain program. You do something and you exit. There is some enough again for OpenMP and open acc, just the single keywords: differences ish.

A

I mentioned the differences structures. You have to express a start and end with single function. Unstructured you can multiple start of endpoints. Can branch across multiple functions? Okay, now, back to this example code, you now have to enter data copy in your initialization function and then exit in your deallocate function.

A

The time wise is actually pretty similar with the for this small example a little tiny bit faster, but this is the the open ACC version again. I did on open conversion was fortune, so it's faster than open SEC version with PGI.

A

So this is the final thing I did. All these runs never seen is on the quarry GPU except one one dot here is the open empty with Excel IBM Excel it's on Summit, so GCC opening is like way below performance. Then everything else. Otherwise, the.

A

Opening mt c c e9 is actually very good.

A

Blue is sea fortress, so C is clan-based, c c e9 c fortran is classic cray, so CC, 8, OC and fortune are classic cray. It's very good crannies to a client doesn't have flan, friends doesn't exist, yet word will so there's only C version.

A

Otherwise this is the open, SEC yeah with PGI there's a regular version and manage a diversion, regular version. Probably you can just more if you want to more it, but but up to now this one is like not fully optimized. You can do more with each of them with async and no way to all these other things, but it's not up there yet didn't do that. It's just a data point up to this any questions.

A

Okay, go on so I want to mention something me doesn't exist out of an MP. This is all ACC Colonels. The Kuro sister, like you can just say, hey I, want this hot spot region to be on the target device and I just put an ACC Colonels there. Without doing anything. First, you need to mention worry about some data, but otherwise even data is like like implicit as long as it can manage to do it out to the data for you. Otherwise, if it's not safe and actually for the C code, I got runtime error.

A

Who's managed no, no without managed.

A

So, basically, the the kernels, um even if there's an in the multiple loops and I'll, try to do multiple loops for it, because this tree this is a whole region and I'll do whatever is safe. Whatever you can optimize for you, peril is basically it's more like explicit way of the component that the programmer tells hey. I want you to do the the offload our ization and where are the loops I wanted you to add, like you do lots of menu? If you don't put a loop, sometimes some comparative in the loop for you.

A

Sometimes I won't I even found some companies with the reduction for you. If you don't say it, but it's not recommended you should always do as needed. If you want to run it was another compiler it'll fail for you, for you forgot to do that with kernels. Your correctness is guaranteed and but usually with. Kernels is not as performed as as the hand tuned versions. So, as a start, it's easy way to do.

A

So--Here not managed I got runtime error for C, but for fortune it's okay. The reason is was a see. It doesn't know about its data, something about illegal data address.

A

Now is the manage it. Like a sure, remember. The previous data was like about one second: what's the managed that's very easy to come to do now, but without any data anything it's easy, but now the performance not as good. It's one point, six, eight second.

A

Okay, so that's the example: I didn't use this update directive in that example, but sometimes in your code you might do something on device and then you want either another region, but in between on the host, you want do some you know exchange. So that's what update is for there's a self and device for ACC and from and to for oMG meaning you know whichever direction it is needed right.

A

Now shift gears so introduce all these things now, let's just say what I open ACC, what OpenMP, how they're in you know community how without available compilers and all these other things, I think for what I said showed is like they're, actually pretty much similar to. How do you, you know, add to your code right? It's it's more of the things like the real. Adding directives are not hard, it's word add and what to add have an after.

A

You add how do you actually actually modify your algorithm to be able to utilize the hard work you know terrorism, so this was actually produced. The same was opening people or bins a sec, but so for the the big picture. How why there are openings in OpenMP?

A

The open appears like here is a big open, NPC, Yugi and I open it open it, piers lots of compilers and suppose lots of architectures as well so see if you open a cc when, when I think when target, when Titan was around it started to, we want something quicker to be able to perform well on the GPU. So when a cc was like introduced in 2011 and then what we did is like open, a cc would have those features and things and then open NP with hey. This is good, let's take it.

A

So here's like features in 4.0 started to do GPU target 4.5 is a major in optimization for more target things and for 5.0. The more features are like to mention about loop and unified man. Loop Ont, loop, unified memory, or these things so more and more mature OpenMP as well open SEC is still the GPU programming is still more mature, then and of then open MP. At this point, for especially for the NVIDIA GPUs.

A

And here's an open, ACC resource page so here for compiler wise available on parameter. We will have PGI and GCC what GCC performance, not that that good, then Cray create deprecated, open ACC support, starting CC and I, and then our other do e systems and probably still those because that's what open, ACC, the commercial or non-commercial, bigger, compilers available.

A

For open MP again, we do have our web page list. The list is very long. It's like one third of it is shown here so for the parameter it will be PGI because, like I think Jack mentioned we have PGI was in with us seven and NRI that develop the OpenMP support for the poor mother GPU. The timeline is also when poor mother is here. We would have an open, compile open version released version, so we will have PGI we also and to expect that CGI would leverage their open, MP, open, ACC implementation expertise.

A

So this would be a good compiler and CCE now focused on telling more open, MP and open ACC and clan is also only for open MP, not for 10 sec, GCC I, put GJ twice and plan friend is also part of our VM big effort. Community developments involving I think in my VM PGI is what ray yeah yeah and Intel over there and then for the other. Do ecosystem I, basically the same but I added IBM and you know, and and AMD. These are, for the other audio ELab bigger systems.

A

The summit front here at Oak Ridge and the Aurora at Argonne was Intel and the IBM is not far for our for the frontier for the for for summit, so there are more opening P compilers available. This is the slide concepts from the recent goe openmp pause. Is it oh like we need to make sure we would consider a portable machine solution can target different plat platform across vendors. So here's a list of these timewise for that supports openmp I mentioned them all earlier.

A

Here, I want to show you a list of conversion, it's actually almost two hundred ones. So at this point would say: conversions not hard. If you already have open sec, is this more of the how your code would apply to which compilers in which which implementation scheme which compiler the maturity of this would give you a good performance? I, don't need to say this I think. Basically, the can worker vector is distributed power of four and simile and then other of the num of the data causes.

A

Okay, I showed this then I. Now another shift gear. Basically I want to show you some of the existing community effort. People are doing with comparison porting to open in P and what the performance they're, seeing and I think the biggest outcome from this is also people are discovering lots of a compiler bugs in open, MP, implementations or people are finding. Things are missing that we they wanted see. There were ask for the openmp community to implementations, to fix it or to you know, include it in the next specification of this kind of feature request.

A

So this one is from Chris daily and raha of Chris's. Here we have custom, you can ask him mini app GPP.

A

So it's pretty similar to the AC PTI, open, ACC or CUDA IBM and the other three are the open, MP IBM clan and creo die the open, MP implementations.

A

Pretty good this is one of the coach was out short by the Oakridge people. They actually the actual major benchmarks and confirm comparing with the open, ACC and open MP. The green ones are open and peace better. The blue ones that open ACC is better. So it's actually also the results in 2018, it's not most up-to-date, but still at that point as it's, the right version of opening Peter compare 4.5.

A

This is also Chris Terrace and results. Hp GMG and the compiler is reaching 70 percent of CUDA and.

A

He found lots of compiler bugs.

A

Okay, there's one of the applications team did access bench, but they think they said is that porting is easy and they didn't find bugs and IBM IBM Excel is very fast even better than the coup de main version. A clan is doing issues in slower, I.

A

Think I already mentioned this group construct loop cultures to it to just like open ACC, but also it makes sure your loop is done. It once exactly not like in some other instances like if in the nested loop, and then you would would actually give you one run at more than one time, but at least one this one is like exactly once.

A

So last few slices I just want to talk about best practices and some recommendations so for our opening, CC and open MP I think it to offload it's basically, these applies to both of them. You know you want give enough work for GPU to do and compiler hint can give you lots of information to find out what things are missing. What why they are not vectorized. All these informations help you there's a compiler has good hint. Api has good hint.

A

Sometimes you want to reorder your loops or transpose your array so that you could expose more at the vectorization.

A

You want data to be on the device as long as possible.

A

Claps claps is good that you can roger workspace, that class for opening CCTV 32 one for 32 and, lastly, choose all different. Try out different compilers. I was surprised to see actually the CCE compiler was performing so well for that about that code and trying different compilers, and if it's not hard to do, you can try post, open, MP and open CC as well.

A

So if your code already has open acc, especially coming from you, know the the Oakridge side there, it's fine parameter whose PGI open sec you can. You can continue to use it, maybe and at some point, when you wanted to port it to other, do in machines, you can probably transfer it to OpenMP openmp, but if we're starting from new, if your new nurse key user just for continuty purposes, is probably good, we already been using MK. I press, open and chief a long time a nurse and was open with the PGI openmp coming along.

A

It's like. Naturally, you could continue so the performance wise week we looked at that is the it's on par or catching up with an open ACC, depending on which compiler implementation, you're looking at IBM, seems to be pretty well and CCD in some cases as well, I think also, but but all these open, MP compiler informations like virtually evolving and fixing and like try studying starting to implement the five-o in standards. So you will see more and more further quality versions in your future.

A

Just I already mentioned this, so we do recommend opening P for perimeter, especially if you haven't used any of those open SEO and we open MP yet and especially when you were, we use formatter one PGI compiler for open MP is available. I think this is my last slide.