National Energy Research Scientific Computing Center (NERSC) Introduction to GPU Training, February 2020, 14 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Intro to GPU: 06 Debugging on GPU

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi everyone welcome back to afternoon session. My name is Hasan yang and I met. The I am working in the nurse, give user engagement, group and I'm going to talk about the debugging on GPU, so maybe I'm severely underestimate under estimating the number of threads here, but I can cut only up to a thousand.

A

So what you know the problem here is that we are running lara lara threads, so that is very difficult to to know that who is doing what and when the error occurs. Where does it occur? It's very difficult and in the old days, the very, very good role of using print statement. It's not going to work here. Okay, so I used up instrument a lot for my thesis. You know project, but it in other work. So you have to use the debugging tools. So this is a these.

A

Are tools that I'm going to cover today, so that gdb is a kind of gdb. It's an extension version for CUDA and up to demand check. There is a demon reminiscent of vagrant man check and we have a degree pretty good. The GUI parallel debug recorded to review. We have another popular tool called DDT, but unfortunately we don't have a license for GPU. So for the time being, we have to use all three of them.

A

So what is a CUDA gdb? This is the extension of Agnew gdb for debugging críticos. You can use this for debugging, both CPU and GPU code within the same application. So- and this is a command line mode, and this is basically for non MPI, a debugging non MPI code here, but if you want you can try this kind of trick to use this small number MK ranks and the the one thing that you can note you should notice is that when you refer to some CUDA entity, you can just add a CUDA.

A

Just like a CUDA CUDA thread. One-Seven, you are switching from whatever thread you are on to thread 170. So it is a different from the what the that there's kind of main difference between the gr Kundu gdb and the very good materials that you can use for. Learning the CUDA gdb is the the just user's manual provide a Nvidia, so modulo 2, CUDA and then go there.

A

Get this PDF file, there's very good, and so what you, what you should do with the gdb you set the breakpoints and by the way, the what points is not supported. What points do you know the difference between watch points and breakpoints preferences so where the coaches should stop? When you run it, you preset where the code you should stop, so that you can check the variable values. What point is the program will start stop when the certain variables values changes.

A

You set the watch points for certain variables, it's a pretty useful tool for debugging memory, Carib issue, but the CUDA- because probably there are you know too much resources here. So they don't support, watch point here. So anyway, you can set breakpoints and you can run the code or continue and the window could stops. You can just check the values of the variables or status of the program, so these are three major main.

A

You know workflow with the debugging set, the breakpoints run it and when the CUDA stops then to see examine so there are noting. Is that the you can run another sign go to? This is a second debugger time that I'm going to talk about you see after afternoon, but you can run that second debugger under CUDA gdb, okay- and there are not a nice thing that I find is auto step the CUDA gdb.

A

What it does is that, because we are dealing with, as so many stress right, we can specify certain suspicious area in the code like no line three of three lines or something like that. So in their particular three lines, the gdb will examine very, very closely hold the steps here. So it is single stepping but the the rest of the coast. You run fast right, so when the cool stuffs there we can, we can see where the could fails in the what resolution level. So this is a really powerful.

A

I think that this is a very interesting aspect here and not things that you may want to generate your core dump right code, um because you want to examine where your code fails. I think that the four most important thing with a debugging when you have a code bug is that is to know where the code fails.

A

Once you find that out, you solve that, we have the problem here right and then you can do a lot of thing print statement there, but with a code them you can quickly identify whether CUDA fails and then you can check the variable values. So this is one way to get your coat on. So how do you run the CUDA gdb? You need to build with the gee kappa ogi flag to get the the debug information on cpu side, as well as the GPU side.

A

If you use a PGI for PGI or Fortran put a Fortran, you use that way and the start here you load up the model and do you run the a strong command? Please, you know don't forget to Adam PTI, because this is really necessary to run all the CUDA commands interactively. Okay, the another concept here is the corner focus. We are, as I said, that we are running. You know hot. You really is so many threads here, but I can only exam in one thread at a time so I'm, focusing on whatever.

A

That does real time that I'm on and if I suspect, that the problem is with, with the only on the other another thread. I can switch that focus to death threat here right, so I was examining thread to zero, but at the latter point I can change it to straight whatever.

A

So so, what you need to do to go to a different thread? You can use the either hardware coordinates or software coordinates. Hardware coordinates is like a device. We have a GPS on a single node right. So if we are using one device, one GPU, then you have just one device. So on volta we have eighty sm's string multiprocessors and we have a warp in their fam and the line were with a group of 32 threads right, so lame means that each of these 32 thread I'm all right.

A

One of these and software coordinates is I'm. Sorry, I think that the I made a mistake here. Software coordinates Cano, which Colonel I'm running and the create blog a thread. You know you know that these kind of basic entities- we look for the programming, so to know that if you are coding, I'm on I can just crew that device. This is a hardware coordinates here so I'm on the device. 0 sm 0 warp, 0 Lane 0 to the kernel block threat. Then it either printed.

A

The software coordinates corresponding to that one and if I switch to a different thread, I can say: I want to go to divide zero, but the SM one warp to Lane 3, then BOOM I'm into that. Yet the particular thread here. So this is, you can go to chapter 11 point 1. There are some examples here. It's a pretty good example. The pit reverse is kind of you know for bad words, but you change the order in each in each bike right.

A

So it's pretty simple, but you can just follow this step, so modulo CUDA build it, run it and set a breakpoint in the main function and the set a breakpoint in the colonel name is bit diverse. Here you can set the breakpoint at line 21 and run it, and they either stop at the first breakpoint here and then you can examine certain things and you can continue and you can you sometimes you can forget about the know where you are.

A

If it is thread you are dealing with, you can see info crude address and you just tell you that your zero zero zero zero zero zero and that this is a block size is 20 55 threads here so from here. You can get a lot of information about this, this code here and the back trace.

A

He shows that back trace from the GPU side here, because I'm in the kernel- and you can you can ask about Korea about the kernel itself like that, and you can print black ID you can you can print a lot of you know, program, related entities, block index, credit dimension and then, if you're going to the next line, just type next etc. You can pretty array values check. Make sure that the these are reasonable values. There's something.

A

If you see something is wrong, then you know you need to go back from that moment on to see that why you are getting the wrong values. This is a parameter for the color. So if you do that this is it will show that the these parameters basically printer as the values here. So you can the dereference it to see the value again, you can switch to thread 170 for whatever reason, and they do something and then you can create it. So this is a pretty typical.

A

This this code is not does not have an error, but you can test it today. Another example is auto step. I said that all step is pretty pretty useful tool to me, but the example code. There doesn't seem to work for whatever reason, but anyway, it clearly demonstrates that it is really very, very useful. It will be very useful because you don't know where the Cordillera is, but you set the these ranges of coder, where the code will run slowly then you'll find that either stop it.

A

The coding stop and it'll printed, where the code fail. Lane were device, etc. Okay, and so that you can narrow down and the second tool that I want to talk about is a CUDA man check. This is something similar to the very great mem chat.

A

Just like a very grind, it is made of several tools, so the only probably the same thing is a man check. This is to detect the any memory issues. Memory errors raise check. This can be pretty useful if you, your code, has some race condition between the threads. You can detect it, and the in each at this is pretty minor stuff, because this only detects about uninitialized variables sink attack. You detect something carer again this this man is pretty useful to build.

A

You follow this step, so the first first tool in that the mentor this is to to detects all the thing you know: memory access error, just like a mellow free, the will free invalid pointer to free, hip corruption, etc, and besides, that, you'll also detect some strange kind of collections of error. Hardware exception could I API error checks, but another important thing is a memory leaks right. You allocate the memory, but you forgot to T allocate when people you get out of the corner, for instance.

A

So if you do, if you keep doing that- and you will be losing lot of memory for because of that right, so you, your program will probably eventually crash because in normal memory will be left out so memory detection, so to build to to run up this tool load to the model, run this command to the map check and then some man check options.

A

For instance, if you want to detect the memory leaks added this flag here and as I said, this man check tool can be run on the CUDA gdb inside the gdb, but there's a 1 KB. Yet here so if we run the mem check on the CUDA, gdb kernel launch will become synchronous. You know that when the host decide the Econo Lodge I mean the kernel and should be non-blocking right right.

A

The error transport may be synchronous, but this one. If we do that, if we use this one, then this can be blocking so any people. Aware of that- and there are pretty good examples here. To example, you can just follow their steps.

A

Yes, you could do a lamp check if you run the man check on the clergy dB the Cana launch will be blocking with respect. You know, host CPU side here.

A

So, race racetrack this as I said that is it to detect the race condition, but this only currently is supports for shared memory. I mean sure the memory meaning that own chip, the fast memory right so I said: if you this detects the race condition among the other variables inside the share the memory right and not not not anything else. So to run it, you run this command and it reports.

A

Two types of error: I have three types reported: one is to report about individual, the race condition and the the secondary analysis tell is based on whatever the you know, the race condition race condition you detected. He kind of me summarized about this. This code about this race condition again.

A

Try that and you need check this is, as I said, that if we try to use some very away variables without initializer first, you detected, but that only this only works for the variables in the global memory.

A

Global memory only I saw here that I've toured so, for instance, this will not work for the shared memory variable or local variable right. So to run it you just do it like that sync chart this that detect the the synchronization synchronization between the estrellas horeb you. So this is graphic. This is really truly, you know fully featured in a graphical predator debugger, so the manual says that it only supports Cray compiler, so I contact the vendor.

A

They say that he made it support, GCC, clang and other stuff, but I need to sometime to check it and it also support aces open ACC, and they also said that you need support, cray or kanuto documentation, but I think that I tried the new something anyway.

A

So it's definitely support CUDA and the definitely support MPI to run to use that tool you run like that and that this is quite complicated, but in condense it contains a lot of information call back trace. This is a stack frame. You can see the variables in that stack stack right in here and then you can set a breakpoint here by just clicking on the number.

A

This is colonel and we are inside the corner, and this is the total views convention, so they represent the each of thread or process using these two numbers. One point something: the first one graph read approach to the MPI task, but not necessarily ampere rank and second number is roughly transport corresponds to the thread ID, but not exactly but anyway, by looking at it. If you see 'm negative number in the in the second part, that means is CUDA kernel could a thread. The positive ones are the.

A

The CPU side here so you can set, you can just click the under number on the source pane and to set a breakpoint and before the CUDA code is loaded at that point, that the breakpoint will be said temporarily, but once the CUDA code is loaded, then it'll. Definitely you do locate the earth breakpoint in the correct location and, as I said who's, the host thread is has a positive. Second number CUDA thread has a negative number.

A

So this is a triggered thing because in everything we've done by the the warp level instruction. Okay- and here, if you look at I'm sorry, let's go back here- we are talking about here. So this is to show the the thread coordinates here. So block thread three dimensional entities right and that the thread and minus one. So you can click this button to specify the other coordinates in terms of logical, logical, meaning.

A

The software coordinates in the Twitter gdb term or physical coordinates here that that corresponds to a hardware coldness in the device, the SM Asura, to check the values you can just right: click on the variable in the in whatever in the window. This is called a dive dive on the variable and you can check the value and you can plug your elements. You can get a statistic about our elements here, so I think that that's all I have today. But if you play around with this, this is kind of into 2d tools.

A

So you can learn very quickly. So, if I brushed, if we have a problem and question just, let me know all right.

A

Yeah I I, let it.