National Energy Research Scientific Computing Center (NERSC) NERSC User Training, September 2022, 28 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 07 - Debugging Tools

Description

Part of the NERSC New User Training on September 28, 2022.

Please see https://www.nersc.gov/users/training/events/new-user-training-sept2022/ for the training day agenda and presentation slides.

A

Yeah go ahead: um okay, um so I'm going to be talking about debugging, It, On, Prom water, um it's one of the more difficult topics when you're trying to Port your code, so I'm going to kind of go over some of the tools and uh how you can use them with the different types of programming models and the different types of Hardware that we have um the most common ones. We have are DDT, which is used by majority of our users and total view.

A

These are kind of full-fledged GP, uh GPU CPU debuggers that uh use a bunch of different uh programming models, uh they're graphical user interface, a lot of different ways of doing things, um there's some specific tools from Nvidia coded GDB, which is just a GB with an attached uh Cuda extension um and then there's a cute sanitizer, which does a couple of different things for uh finding memory related, bugs, um there's, also a GDB for HPC, which is a GDB.

A

That's meant for doing GDB like things but against parallel programming models and uh there's a vowel grind for HPC, which is again similar where it's taking Bell grind and applying it towards um finding memory related bugs and stuff like that.

A

But against uh parallel programming and there's also a special series of tools called stat and ATP that are related that are good for finding crashes and Deadlocks because they take a look at where your program's at um according to its back trace and then they kind of merge the back, trace and show you where you're going.

A

um But before you start debugging, there's a few things you're going to want to do. um You're going to want to set up a remote connection, everybody's kind of talked about this use no machine. It's a better performance than a traditional X11 forwarding, although both DDT and total view have their own options. For this.

A

um You need to compile your program so that you're, generating debugging data and typically you'd like to disable your compiler optimizations.

A

um So I put the options in here for how you want to do that with cn4tran and then with uh Cuda there's the host options, which are Dash, G and dash o0, and then the capital G, is what turns on the uh the device um debugging information for Cuda the dash capital. G. Excuse me.

A

um You need to set up your environment in a way so that you can create core files. So you need to tell your shell that you want to be able to create core files of an unlimited size. Otherwise, it's not going to be able to create core files, um and you want to tell your programming models that if they find an error, if they or they're, going to abort, go ahead and dump core file as well.

A

um A special notice on these CR on the cray tools, especially, is that they use something called craze CTI, which is the common tool interface. This helps them uh work with use common code to work with job launchers, such as slurm, and uh it's tied into a lot of these tools. So you need to have the module loaded and for our particular use. You need to be setting this environment variable CTI, wlm, Implement impl, which is the CTI workload manager, implementation and, in this case, we're using slurm.

A

um Here's how you kind of allocate your nodes for debugging everybody's kind of talked about this as well. You want to use your CPU, you make sure to set the constraint for CPU same with GPU and then you're going to want to use interactive or debug depending upon how long you need the node for, and here's a link here to the limits and charges that you can use for setting up Qs on the locations.

A

First, one we're going to talk about is called DDT. It's a distributed. Debugging tool uh supports a bunch of different parallel programming models like MPI openmp, open, ACC Cuda. It supports python, c4trans, C, plus plus. uh Originally it was developed by a company called the linets now owned by a company called arm that develops processors, uh schematics and licenses out processor information, but they do develop software as well, and that's one of the reasons that they picked up DDT.

A

um It has a remote client that you can use again instead of doing the X11 forwarding or the uh remote session. um Basically, you want to just module load arm Forge and then run DDT against the program.

A

There's some extra documentation here, um but I'm going to just kind of show you some screenshots of what DDT looks like here.

A

um So once you open it up, it kind of gives you the option to either run them or attach to some kind of program or uh uh service. um You can open up core file. You have, if you like as well or you can manually launch the back end. That's using some of the remote launch stuff. Is there as well?

A

This is what it kind of looks like if you're going to launch itself launch yourself rather than attached to something running.

A

I chose the application. It gives you the slurm arguments and then there's a bunch of other memory, information or and programming model information that you can enter in order when you start your debugging session.

A

uh Here's what the kind of default UI looks like with an MPI code.

A

um You can see off to the left that it kind of breaks up the file or in the the source code into different functions, looks like kind of your run-of-the-mill IDE. You know has line numbers on the right side. Allows you to check the stack on the right side. Gives you the source in the middle, on the left. It's showing you the current stacks, and it gives you some input other tabs there that for input, break points, watch points.

A

Trace points and logging um chose the uh processes on the top, and it gives you a bunch of buttons at the top for doing navigation like stopping your program starting your program, stepping through things like that.

A

um Here's some stuff where it's also using a Cuda kernel and you can see at the bottom. The stack has the uh has a few additional things that allow for kernel space or a Cuda specific functionality.

A

Here's some more uh specific information on how and where everything kind of works you see. You got your processing entity to control up top navigation, as we talked about before you can right click on a variable within the list in the middle and that'll give you the information, keep it the sparklines at the bottom. You can evaluate expressions based on whatever the current data is, and then there on the left. You can see the stack frame.

A

Here's some more of the specific Cuda type features you can see the GPU devices and the image on the right. The kernel progress um on the left, like where's, in progress for on the device and the uh Cuda stack in relation to the c-stack as well.

A

um Another alternative to this is total view. um This is a similar system. It just has a few different features supports a lot of the other same stuff. It was developed by a different company, but now it's owned and developed by Earth Force has two different options: a remote client that you can download, as well as a remote connection that can you can use you just module, load, total View and run that, and you can also get more information from both rdocs, their docs and the man page.

A

We also have an upcoming training session I believe that that is tomorrow, yep um Foreigner uh for total view, if you're interested in more training click on the link there at the bottom- and you can sign up, uh it has two different interfaces, because apparently people don't like new things. um So the first one here is a kind of a view of their newer interface and what everything looks like again. The similar features to what you would expect from an ID or a debugger very similar to what is in DVT.

A

You see your processes and kind of where they are at in the stack on the left. You have your Source listings in the middle. You have some action points and bookmarks down at the bottom left. You have your loggers, your command line and your data on the bottom, the right side, you have variables and their value- and you have the call stack in the upper right here- is the classic interface as they call it.

A

It's a very older, X11 interface, um but a lot of people are very used to this, and so they like to use it um here. You have again some pointers to different features in here uh where it is uh based on the GPU and the CPU. You have some different Focus areas.

A

You have the threading at uh assigned positive thread IDs at the bottom. You have ways to select MPI tasks, to set your breakpoints and your threads at the bottom. You have the source code listing again. You can see the value or dive into it when you Mouse over uh in the middle there on the source code, uh you have a window off on the left. That shows the state of the MPI tasks and where everything is at and you have a stack frame and stack trace on the upper middle.

A

Next, one we're going to talk about is a Cuda for the gnu debugger. This is an Nvidia tool, it's basically an extension to GDB that supports Kudo programs.

A

So you can use it just like you would use GDB, except that it also has some Cuda options where you can just type in help Cuda, and it will get you some more information on what to do there. There's docs here that you can see from Nvidia uh doesn't do other types of programming models. It just uses Cuda right now it doesn't do anything like MPI or openmp and then or MPI or other big models. You may also do openmpm I, don't remember um here's another one called compute sanitizer that was originally called cudament check.

A

This is a drop-in replacement, I believe they're using the type of sanitizer stuff that you would find in either llvm or valgrind again developed by Nvidia use. Dynamic instrumentation at compile time um does kind of a s run, compute sanitizer and you pop in the tool that you want to use and use the program.

A

So they have mem checker for race, checker, nationalization Checker and a sync checker I'm, sure they're, going to add more Checkers as they uh go along um based on their tooling and based on what llvm probably produces and there's some more uh documentation on how to use this tool here at the bottom uh GB for HPC. This is another great tool.

A

This doesn't directly support gpus. So it is an extension to GDB that supports just other parallel programming models. You just module load GDB for HPC, and you can then run it and launch your apps.

A

um Not only can you launch apps for a minute you can also attach as well so I'll show you an example here here: we've already allocated um the nodes and what I do is I'm.

A

Launching a process set named dollar P of eight tasks for an application called PCM, starts up uh Network in the background and connects all of the debug servers to it, and then it sets an initial break point at Main um of the of the app and you can see p0.7 means process set 0.7 because I named it dollar P it's using the P there.

A

So if I do a listing, then of where I'm at based on that, you can see I get the first line of function main there and if you do a view set of dollar P, it shows you all of the processes. So this allows you to do different types of process sets and uh to run multiple apps and see their different communication.

A

uh Set a break point here um on line 31 of main notice that if I print out try and print out the rank, um which is a data entry data point in the app that it's currently set at zero, because we haven't quite reached that part in the code. Yet.

A

But if we hit to the break point there, I can print it and I can see that all of my ones are now printing out their correct Rank. And that's because I passed, the mpicom rank function in the listing which I did just before MPI comp size.

A

Similarly, there's a valgrind for HPC um again, this uses a bunch of different tools to do like mem checks and does dynamic instrumentation the compile time again. This doesn't support HP gpus at the moment, but it supports other different types of programming models like MPI and what it does is it kind of runs valgrind against each of your MPI processes and Aggregates the data into a more reasonable readable report rather than having you know, end tasks number of reports.

A

It also improves on the messaging that you normally get from. Bell grind: there's a lot of documentation here, both on valgrind and valgrind for HPC.

A

um So you can try on some of that stuff uh sanitizers for HPC. This is a direct uh tool from Craig again, that is using llvm type sanitizers and it's using the same idea of that run. A sanitizer against each one of the processes.

A

Aggregate the reports, except that these sanitizers use static, instrumentation at compile time rather than what the insights or the Nvidia uh compute sanitizer and uh the Crave algrind tools are doing, and uh if you have a very CPU intensive application, these compile time instrument or static instrumentations that compile time are, can save you. Some uh it'll save you some time because they're going to lower the overhead due to the way that the instrumentation is inserted into the program.

A

um They're, like I, said they're based on the llvm and they support gpus with Cuda mem check um and support uh CCE and GC um again you're just going to want a module swap to cray. At this point um they do support gzz, but I prefer to use cray for this.

A

um You use the option here: F sanitize. You need to add that to your compile line, make sure that you're sanitizing against the right, sanitizer um and the sanitizers here listed are for address, leak and thread and I put in some documentation, both on the sanitizers event page and the original sanitizers as they are on Google. ah um So we have stack traits analysis tool.

A

This kind of attaches to your uh processes as they are running and tries to look for Deadlocks what it does is analyze each one of the processes and gets a stack trace or a back Trace and then emerge them together. So you can see the different places where your application might be in the code.

A

Here's what it looks like when you're allocating the nodes- and you run the application- you want to make sure to grab the PID from it and then you can run stat against that PIN.

A

um This way you can easily get. You know a trace out against a hung application if you're having issues with that. This is kind of what it looks like. You can see that this one has four ranks for four colon zero through three and you can see that on Main. They take two paths.

A

The first path rank zero and three go on the second path ranks one and two go on, and you can see how they go down and where they end up at, and these represent the different types of back traces at each one of the processes is taking.

A

Atp allows you to do this in a more automated nature, rather than having to start stack on its own. You can just in a module load, ATP uh set some variables and then once the application uh dies, or you send a termination signal to the application, it will automatically dump some stat information for you, as well as the core files that it selectively chooses. It won't print out all it won't send out all the core files you can control, which ones are sent, um but it will send a selection based on the back trades.

A

So you don't get a lot of duplicate information.

A

um There are some optional and other things that should be noted. uh You can set the GDB binary to be whatever GDB version you want um in ATP that can be useful because internally, it normally uses stat to identify the back traces, but in this case it would use GDB, which sometimes can be a little more useful, there's also for uh Fortran and gnu.

A

You need to make either a compiler or an environment variable change to use ATP because they both use their own back Trace um information and again you just pretty much s run your program, determine that terminates or it gets a signal and you stat view the dot files that come out the dot. The files that are outputted are all in dot format, so you can also look at them in gnu plot or anything that supports the dot format.

A

This is kind of an idea of what it looks like here. You can see right away. That's something took a fault into a summary, um and you can see from the side here that it ranks uh three through seven of the eight.

A

uh Thank you for uh listening and welcome to nurse.