National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Intro to CUDA programming

Description

Dossay Oryspayev (LBNL), Muaaz Awan (LBNL), Hugo Brunie (LBNL), & Michael Rowan (LBNL) present a tutorial on Intro to CUDA programming.

A

The hardware the scheduler schedules, warps of threads, that is chunks of 32 threads onto the hardware and, as we saw yesterday in the morning, talk in the first session that, unlike cpu gpu, is a latency hiding device and to to optimally exploit the massively parallel architecture of a gpu. uh It needs to do that. It needs to hide the latency and to better understand what is latency hiding. Let's have a look at this figure. Let's assume that at cycle one, our device had four uh three available walks that could be launched.

A

The scheduler will pick up any of them and and launch it now. Let's say the first warp is picked up when it's launched and the first walk makes a memory request which is going to take uh two cycles to process. That means in the next cycle that warp won't be able to progress so that what the device does is, it picks up another available warp and launch stack and the process continues till the memory request from the first walk is served and it is available to go forward now in the cycle. Four.

A

If we had another available warp, it would be kind of a random pick between warp fund, because that is also ready to go forward and another available war. So what we can take away from this figure and the concept of hiding latency is that the more work the device will have to do available. It will be able to hide latency better now. Imagine after cycle one if we did not have another warp available for next two cycles, your device and your resources would have been sitting idle so to make better use of resources.

A

This is how a gpu operates. So you can. You can take away from this that, in order to over subscribe your device, you need to launch as many threads as possible.

A

That means larger your grid size, the better performance you're going to get, but there are some physical limits, for example, for the v100 device that we have on our system that you are currently using. The number of maximum warps resident per sm can be 64.

A

and which translates to 2048 resident threads per sm, because each warp consists of 32 threads and if you take a product of 32 and 64, you get 2048, but there is another limit that is of maximum number of blocks. So the way these threads can be divided across blocks is is up to you, but the number of concrete blocks per sm cannot be more than 32 for a b 100 device.

A

Now, since v100 device has about 80 sms, that would mean that the total number of concrete threads that you can have on the device is 160 000, which is a product of 2048 and 80..

A

So that was all about kernel configuration and you will obviously experience this uh concept in the in the exercises that we have for you. So to top this off. Let's add another concept to this: the concept of memory correlating- and this is one of the more important ones.

A

Let's have a uh yeah, so the global memory accesses from the device are serviced in the form of memory transactions of size, 32 bytes. That means, uh if your thread is going to access a four byte number, that is an integer uh whatever it accesses the memory, it will be uh processed in the form of 32 bytes.

A

That means, if look looking at the bottom figure, if this is the thread that is accessing an integer or four bytes, this whole 32 byte chunk will be processed for this particular thread and if you do not use the remaining amount of data that will actually be wasted, so the bandwidth is being wasted to make better use.

A

A good programming practice would be to ensure that your consecutive threads inside of our accessing memory locations which are closer to each other contiguous memory locations and that is going to uh bundle up all the memory accesses for that particular thread within the least number of memory transactions possible.

A

So the top figure is what you want to do, and bottom figure is what you don't want to do to try these concepts out. uh You might want to move into the section two folder in the github repo that you have and open up the readme file. The readme file contains all the details of exercises and how to build, run and what parameters to observe and how to change them.

A

uh Just to brief you about a few things, vecad.cou is the main file, and that contains two kernels uh first, is the vector, add kernel okay, so there is a bit of mess over here in the slide. uh The first kernel for the exercise, one is going to be the.

A

Kernel and for the next exercise for the memory would be rec, add kernel memory. You can build a rackhead file with with the simple make command and you can run using sh run.script and if you want to uh for the second exercise the memory exercise you want, you might want to use the other script that is transactions.sh the details about when to use which script and how to build are all in the readme file. So I will be here. If you have any questions, you have a next about 25 minutes. For this.

A

A

The next section will start at 3 20, so there is a 20 minutes break and you can use that for exercises as well. And if you have any question, feel free to unmute your mic and speak up as well.

B

So the session three is about uh debugging on a cuda code so because yeah, we all know that coding is not just writing some line of codes and then having great speed up is also about debugging, a program which happens always.

B

So what are the tools to debug? uh Actually jonathan this afternoon, already made a presentation about it. So if you remember, um this will be very easy to read for you.

B

um Basically, the printf is the one that you go to when you don't have other possibilities or when you think you can debug it very, very quick, and you don't need to be aware on cuda, though, because when you put a printf in a kernel, uh it's not one two or eight threads that are executing the printf is it can be several hundreds or several thousands of threads, so your terminal can be uh over over.

B

uh Yeah can be too many output in the terminal. So what you do basically is just put a conditional branch on on the thread. Id here is in a one dimension, kernel with the one dimension block, and we pick the the master thread.

B

So another way to debug is to use cuda gdb, um it's very similar to the gdb. You are used to use in a classic cpu, so bt for the backtrace.

A

B

Of info threads to have information on the threads you're executing, you have to add the the keyword cuda. So it's info info, cuda threads, then print to print.

B

The value of the variable continue to run the program when you have stopped it with the breakpoints step, to go step by step to go inside the function next and quit to to create the the gdp and run it's not written but run to run the program at first, you can choose the context on which you are executing with good rgb, which can be very useful by choosing the the id of the kernel block and threads.

B

Cuda memcheck is another tool which allows you to check the read and write of all the threads of the cuda canal into memory. This tool has several other options like race, checking, init check and sec check.

B

We won't go into the details of them, but know that they exist and if you want to to be sure, your your program is uh is valid like when you use valgrind on a cpu code. You can use this tool on the on the coda code and total view is to me the equivalent of ddt.

B

If you know for cpu, it's a gig debugger that allows you to to put the the breakpoints and to execute so step by step like a gdp, but with the interface so to generate to have the the information necessary to to have a good information for the debugging.

B

You will use the iphone capital g or iphone line info to get the the liner and the debug and air dynamic gives you the the symbol, information on the cpu side, which you must put a x compiler before the I dynamic to for telling nvcc that this option is for the uh host compiler, how you will use this uh gdb, so cuda gdb, um you will use it by putting breakpoints, for example, b.

B

My functions, dot cu, which is a file uh 2.48 means I put a breakpoint like the line 48 of the file myfunc.cu, and then I will run the program with run and it will stop at this breakpoint. If it reaches it, you can print value, p, var, p array at 10.

B

Prints, the the 10 10 elements, first, 10 elements of the array you can control. The execution, like I said, with run next tab continue. One thing to note is that there is no watch points possible on cuda, so yeah. If you were used to use a watch point on cpu gdb, you cannot hear the changing of context is a bit more complex than on cpu. When you want to to change the threads on cpu is straight forward.

B

You just put the id of the threads here you you have a three to four dimension id, so you can precise the id of the thread by its um it's a device assemble up and lane. So this is the hardware way of giving the coordinates of the thread, and there is a software way of doing it with the block and thread you can notice here that block and threads are other three integers is because this can be a three dimensional.

B

To execute um you will obviously a lot could have a it must be already done, and then you s run your your code, uh don't forget to add the dash pty, which allows you to execute a good hddb and then a dash dash argument. Args. If you have arguments for your program, if your program don't have arguments, you are not obliged to add the args on this session. You will have two uh files, uh debug printf, that's you and the other, which is a debug memcheck.cu.

B

You will go on them, one by one. You will have to modify these files even uh just a bit the dot hpp you don't have to modify it to compile just use make. First, you run the code with the s-run and the first file and you will have the the results on the left with the correctness test that fails and your goal is to go on the right with the correctness test not pass to that you can your help by the what is thrown here on the output by the printf.

B

That's why we are saying that we're debugging at printf and the id is to understand uh what is going on here and the way the way the taste is failing.

B

The kernel is the same as you executed during session one session session. Two, it's an addition of two vector, then the second exercise is uh you run the code with the.

B

So if you run the code without the qdm check, it will just say you tell you that everything is going all right, but then, if you run it with qdm check we'll, you will see a bunch of errors which are like this one and on this error, what you can see as information, which is quite useful, is the id of the thread and the block in which read or writes are doing badly and it saved the whole the whole back trace.

B

So here we can see that there is an invalid read of size, four bytes, so it's one float for these threads and then you can insert a breakpoint and go into the context of the threads by doing cuda canal, the raw threads, seven, zero, zero block, zero, zero!

B

Oh ar block three- is earlier.

B

Other commands you will have to use for good rgb is, for example, uh setting a breakpoint with b and the name of a function.

B

So, instead of putting the name of the file two points and the number of line you can put breakpoints and the name of the function, and it will stop at the start of the function.

B

I thank usual young, who did uh some very similar slide for the debugging on gpu in february 2020. For now so now you can go ahead and and start the exercise and if you have any question we're here, good.

C

C

Is my microphone working.

C

Yeah, it's good all right, so welcome everyone. This is session. Four of the cuda tutorial, uh we'll be introducing here some of the nvidia profiling tools that you've heard about in many of the talks today and yesterday.

C

um So we'll just start by throwing up this diagram that you've seen a couple times already. This is another incarnation of this optimization workflow diagram.

C

um You might follow a process like this, where you profile your application and collect some data, and then you analyze this data and try to identify bottlenecks, or maybe like kernel kernels, that aren't behaving as you would like them to, and then you try and tweak things in your kernel and then see whether these tweaks or things that you've changed are actually changing the application behavior in the way that you thought or the desired way, and so two tools that can really help you with this.

C

These two steps: profiling, your application to collect data and then analyzing that are insight, systems and insight. Compute and the tools have sort of different scopes and site systems is a tool that can give you a cohesive picture of how your application is interacting with various system resources available to it and end site. Compute is more like a targeted analysis tool that can tell you about.

C

They can give you specific performance, metrics after you've say, like identified a kernel that you would like to optimize and further work on.

C

So first, this is a very, very, very quick interview. Sorry, sorry, very quick introduction to msi systems um and we'll see more uh we'll say more about these various timelines uh in a hands-on demonstration in a moment, but the three main features or three main timelines here- are the cpu workload timeline.

C

I'm not sure if you can see my cursor a.

C

C

All right, I'm going to keep this view just so that you can actually see where my cursor is um so there's a timeline here that shows you the uh workload on your cpu cores, um so I'm pointing to this there's very thin black line you can in in the gui.

C

You can expand this range, so you can see this more clearly, but this shows you like the cpu utilization, cpu utilization, um there's another timeline that shows you how the uh the os threads of your application are interacting with the cpu resources and also the gpu, and these different lines here indicate different things. So the black line is telling you the cpu core utilization and then the the line below this, like can't a little small here.

C

Let me zoom in um so then there's a red bar here that if you hover over this uh in the application, you'll see a tooltip that will tell you which core uh this actually corresponds to. So it's like the workload and black or the the the core utilization in black and then this red bar corresponds to a particular core and then below this. This shows you the thread state, so whether it's active or scheduled or stalled things of that nature, and so this thread timeline or enzyme systems has support for like several apis.

C

So you can trace um uh like commands from like the cuda api or openhcc and several others that are not. um I don't have the full list, but there are many others that are supported uh like for this tutorial, we're you using cuda, so you'll be able to see um how these commands from the api. Like uh you know, your cuda device synchronized, which you see in this example, you can see how this is called from one of your threads and then so.

C

The final timeline here is this device timeline, and this tells you about uh memory operations and uh like the compute workload on the gpu. um So for this example, we we have a tesla v100 and there's like some blue.

A

C

Here, the height of which tells you um sort of like the kernel coverage um over a given time, um so that is a like basic orientation for insight systems. uh One really useful feature is that if you uh you can like click on uh things from one timeline, so say you could click on from the device row. You could click on this blue bar that again, this corresponds to a kernel, that's running on the gpu, and then you can see where the the kernel launch was called in like one of your threads.

C

So here you can see additional information about the the launched kernel um like the begin time, the end time, um like the stream that it ran on and.

A

C

Also, the the streams are available in this like drop down menu under the device row, and this sort of analysis is useful because it tells you like, what's the latency between when you um like, launch the kernel and then once the kernel actually running on the gpu, so that is on site systems and then uh so. This is a screenshot from end site. Compute.

C

There are a few I mean there's a lot of information in this generated in one of these reports, but we're just pointing out a few very basic features.

C

There's this bar chart that can tell you about the sm utilization and also the memory bandwidth, utilization um and and these values are phrased in terms of something called like a speed of light value um which exactly what that means is hardware dependent. But it's meant to to just indicate uh it was a fraction of the peak performance on that hardware um so like. If, if this fm is, I guess this looks like it's around three percent or so.

C

This means, like you're, doing three percent of the compute workload that would be possible on this gpu. If the hardware was being used at peak capacity and uh some similar meaning for the memory bandwidth here um so they're, also, if you were to click this apply rules here, this will automatically generate some tips like automatically.

C

Generate tips about like possible performance bottlenecks that are kind of recognized according to uh some yeah, it can tell you uh things that you should look, take a closer look at to improve the performance, and uh so this is actually a really realistic case of what you might see if you're looking at an application and you just like open up some random kernel, it's like quite common that your your kernel is neither memory bandwidth bound nor compute bound, and in this case it might mean that it's latency bound, and this encourages you to kind of look into further issues.

C

So I'm going to switch to a hands-on demonstration, we'll open up a few reports and just poke around these a little bit um so first I'll, uh I'm not on a gpu right now, but I'll just kind of show you the commands that you could use to generate a report, we'll start with end site systems.

C

So if you're logged into or if you had a gpu session, you could just use nsys profile.

C

You can specify the name of your profile. So something like we're going to generate a report named ensis underscore profile.

C

You can specify stats equal to true. This is going to output some uh profiling statistics to the command line. This will also generate an sql database with all of the profiling information. If you wanted to use that I've never used that before, but it's generated there and then. Lastly, you could you just select the kernel that you want to profile, so you could do something um if you.

A

Want to try this.

C

Out you can try profiling, the vector edition kernel. That's been used in the other sessions. um There are addition uh additional options here. uh You can specify like a a delay and a duration in seconds with like dash y and dash d respectively. So, if you say like delay is like one, then you're gonna wait one. Second before you start profiling and you specify like five, then you're gonna collect data for for five seconds.

C

Okay, so then you can open up a report with nsight dash dis and then this you open. This report.

C

C

Okay, so we see already, um let me expand this.

C

You can see uh the timelines that I pointed out before. There's the cpu cores workload timeline, there's the os threads timeline and then there's uh this device timeline. um So we can just expand out this uh cpu cores workload. We see that this cpu 59 seems to be doing a lot of work and there's some some uh stops doing some work for some reason around, like one one point, uh four seven seconds or so, um okay, and so that's sort of the cpu workload timeline.

C

um As I said before, there are these os threads and it shows you all uh the calls for the various apis. So we see calls for the kudi api, like there's a cuda malik here, and you can click on this and um there will be like highlighted ranges here that will show um sort of like the correlated calls and then.

C

Like beginning and end times for the thing that's actually uh executed on the.

C

All right, well, okay, so sorry, what I mean to say is: if we uh we could click on a kernel here um uh from the gpu row and let me zoom in a little bit.

C

um We could click on a kernel, that's running here and then we can see uh that it's actually launched um at this time from the cpu, uh so we'll zoom in a little bit more. So we click on this. This is again. This blue bar down on the device row is showing a kernel, that's executing on the gpu, um and then it's launched uh at this point. Okay, so there's we can see, for example, like there's a launch latency of.

C

Yeah, this highlighted region is showing the launch legacy, so that's like 44, uh 50, microseconds.

C

Okay again, if you expand out the device row you can see, I have the kernels and memory operations.

C

You can also see the streams if we expand.

C

C

C

Now we're going to switch to insight compute again, we'll just show the command that you can use to generate a report within site compute. um So again, if I was logged into or if I had a gpu node, then you could do s run and then envy nsite, cu, cli.

C

We can generate a report called cu profile, then you can specify the kernel name with dash k. So this is a kernel for which you want to collect some profiling, metrics and so in. In the case of vector edition kernel, this would be vector, add kernel um if you're, using this on on some more complicated applications. You may have like a very mangled name, so you want to use like regex syntax to uh maybe identify the the kernel here um but anyways and then the last uh thing to specify is the kernel to profile.

C

So you can use this command um and then this will generate a report.

C

And so the report you can open with envy and site.

C

Okay, so we'll open this up and take a look.

C

Okay, so again, these are the uh measures I indicated, or I mentioned before. This is the showing the sm usage and the memory bandwidth uh usage.

C

um One pretty neat feature here or let's see uh so. If you go to this page drop down, um you can click on like source and you can see the number of live registers per uh line of like low level sas code, which is kind of cool.

C

You can also see the speed of light values for uh different pipelines here. So, uh if we're just kind of poking around these lists, we can see that the speed of light uh value for this pipe uh fc fd64 cycles active. This is zero, so it means we're not doing any um looks like the the fd64 pipe is not being used at all. In this example,.

C

Yeah so there's a lot of information here um and I would just recommend everyone to take a look at the official nvidia documentation, um because we've only provided like very kind of brief introduction to these tools, and there are also lots of really good tutorials, like from the blue waters workshops.

C

So there are tutorials on like insight systems and then site compute, and I think yeah. That's that's all I have on uh these profiling tools, so I'd recommend like in the remaining time uh you could continue work on any of the uh uh sessions like. If you wanted to continue working on any examples from previous sessions, uh you could do that or or try out some different options using the command line. Interface for end site systems. Try like profiling, the vector edition kernel. um Another option is to try and add, like use, nvidia tools.

C

Extension to add some um like uh add some timelines. uh uh I think you could see this in the command line interface. um I think there should be a some statistics. Output, if you add, uh like use nvtx to instrument your.

C

C

C

C