National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Profiling/debugging for GPUs

Description

Jonathan Madsen of LBNL presents a talk on Profiling/debugging for GPUs. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Yan Zhang.

A

I'm going to talk about some of the profiling debugging tools, no doubt over the course of uh yesterday and some of today you've had some of these referenced um or you know. Maybe they were presented some of the tutorials yesterday. So this will kind of be a review in certain places.

A

A

Turn on captions, okay, so um just quick little overview, I'm going to talk about uh some of the nuances and the different tools available for both uh debugging and profiling.

A

um uh Debugging on the gpu uh is a little bit different than the cpu, because it's asynchronous with the cpu so sometimes where you detect the error, isn't exactly uh where the error occurred or say the the current was launched in a different place and if you are used to working with highly parallel codes, you know that debugging highly parallel codes, like which what you have in the gpu, tend to produce sort of highs and bugs, where the simple act of trying to uh study it uh makes the bug disappear.

A

This uh is uh quite a difficult problem it, but there are tools um available to sort of solve these things. um When debugging. uh I think this was covered elsewhere uh for nbcc, uh the lowercase g flag generates debug information for the host code.

A

um You need to use the capital g flags to add debug information for debug information for the device code, but you do not want to use this uh debug info when you are profiling, you want to use the dash line info for nvcc.

A

um As an alternative, when you are profiling and the optimization level that you said affects the debug info resolution, so if you said oh three and and dash g, uh that will have significantly less uh debug info than if you said, dash 0 combined with dash g, some compilers support sanitizer flags.

A

I don't believe that mvcc supports this, I'm not sure if clang does, but you know, hopefully that is something hint that they will eventually address, because these features are very nice, like the f sanitized address and data race are very nice for just compiling your code and running it and then just getting a report at the end, most debugging error tools. Oh sorry, most cuda routines return uh could ert, and you just you have it. You want to check whether or not that equals cuda's success.

A

Writing error code checking is monotonous, so it's highly recommended to use macros to sort of easily wrap these. These functions. Runtime api functions that you're calling that you want to check, but doing too much error checking can degrade performance.

A

You want to selectively provide different macros that when you know the, when you compile within debug, it's an empty macro or it just implements the the actual function call and then, whenever you in debug is not compiled, it actually does the error checking one thing to be aware of is if, with respect to that asynchronous stuff, if you ever do it could get last error, it will reset the error code.

A

So if you need to check to see if there was an error, but you don't want to reset the error code, you use peak at last error and then the could run time. Api also has these these functions for getting descriptions and names of the error.

A

If you are doing debugging most of the time.

A

Or sorry lots of times you you do it simply be it via the command lines, although a lot of gui's interact or can integrate debuggers into their environments- and this is honestly, if you've ever used, you know like gdb in the command line and ever used it within an ide.

A

You will quickly become a fan of using it in the ide. It's so much nicer to use um you know, but sometimes this requires actual integration with your build system, and for that it's one of the big big benefits of using cmake there kuda gdb is the nvidia supported debugger, it's pretty much modeled after gdb the gnu debugger, it's built as an extension.

A

uh You simply just run cuda gdb and then the command options. And then you will get an interactive prompt. You, you type run enter whenever you get the the error you can hit back trace and it will show you the call path to where you're getting it. And then you can print variables, switch between frames and stuff like that, then there is the cuda memcheck tool, which is a functional correctness, checking tool. um This is sort of similar to their sanitizer stuff, uh the sanitizer stuff that I mentioned previously.

A

uh It supports an integrated mode with kodi gdp and it's for detecting sort of memory, access and uh and and issues that that don't cause explicit failures.

A

Sometimes, though, uh printf really will just be the best debugger um and you know, I think I think all of us have used printf as a debugger. um Quite a bit.

A

uh Usually printf is also great for enabling um just sort of log messaging, um so you- and if- and this is always nice to have in a sort of code that you are distributing so that if you know a user comes back to you with an error, you simply have said say: set an environment, variable, hey turn on verbose, equals three or something like that and you'll see.

A

You know, values printed out in the code that help provide you context about where the errors are actually occurring um and that same thing with that macro, you should have sort of always on error checking. So let me move on now to profiling.

A

um Just like debugging profiling has some nuances. uh Measuring that performance can degrade uh your performance. Unfortunately, um hardware counters in particular the way they are implemented with envy prof, that particular api. um It sort of serializes the kernel execution, so you don't get any overlap and also kernels are sorry. Hardware counters are a finite resource. So if you request measuring too many hardware counters, you actually kuda has to replay the kernel in order to collect all the hardware counters.

A

This is slightly different than how it works with other hardware counter systems where they swap them out at timing intervals called multiplexing.

A

And also when it comes to profiling, the performance on the cpu still matters, um you know simply optimizing. The actual kernel execution isn't isn't as really as important as uh optimizing sort of the hosted device, communication patterns or or your memory access patterns.

A

So you have to actually sort of look at what your cpu is doing alongside your gpu and in order to do that really, when you start profiling, the very first thing thing you need to do is: do a complete top-down profile of the cpu and the gpu educated guesses, on where the bottlenecks are just tend not to be dependable, especially when you're initially recording your code to the gpu.

A

And you know just all you really need there is sort of lightweight timing, information about where things are spent and sort of a visualization of the overlap and insight systems is is recommended here, as you can also use mvprof, but inside systems is the is the new very nice scooby that nvidia has provided for us.

A

The second stage, obviously, is the optimization. Once you have identified where your bottlenecks are um insight. Systems, as I said, is good for that high level. Information for your communication bottlenecks, insight compute.

A

um So it gives you a per kernel breakdown and you can identify computational bottlenecks um with your memory, access and occupancy and stuff like that, once you have optimized the code, especially at a hackathon, if you spent you know a week at a hackathon optimizing the code using all these tools, you have things migrated to the the gpu and they're running well.

A

At that point, you really should take advantage of using the mvtx api and marking the important regions in your code that you want to be able to sort of identify later when you are profiling, say a month month from now and and this just sort of helps with visualization.

A

So you can easily just sort of look back and say you know: has this region expanded significantly or uh reduced in its its run time, compared to this old run and really uh do sort of continuous monitoring? The guise are very nice um to use, but they really don't get run all that that often not you know that and and integrating sort of a continuous monitoring of performance that you can refer back to or easily run without a gui is highly advantageous.

A

um So you, you know. A lot of us have simple. You know on in our cpu code, simple uh cpu timers, um and you want to try and integrate something like this into into your code, and there are on the cpu. There are compiler-based tools that sort of make it easy to do uh profiling, they've sort of built in profilers to the compiler um like x-ray, is part of clang, and then they also have.

A

um Waze anyway, uh I'm sorry, I'm just gonna move on the clang. The the flags above might work with clang, but I I haven't actually tested that and to my knowledge, mvcc does not have a compiler-based way to instrument your code.

A

And, as I mentioned as far as gui's go, there are there's insight systems, compute and mbprof. There are also several open source tools. Amd has a a profiler, uh then there's tao, scorpi, hpc toolkit they're all open source projects that have been around for a long time. They have guise visualization stuff, like that, a lot of features because there's uh there's really not anything particularly special about insight systems or compute, and so much as they're doing something that an open source tool cannot because they provide apis for toolkit developers.

A

So envy prof uses the company callback api, which has been around for a while and has widespread support and open source tools. um Inside systems is mostly timing, sort of stuff tracing and it has widespread support and but the the new insight compute api is a uses, a new api and in the open source tools.

A

There's there's minimal support at this point, even though there will soon be quite a bit more uh as far as building something into your software. The cuda runtime api um has some some basic utilities that you can use. For example, they could have been elapsed time for getting um the timing between two two events that just sort of sticks a time stamp into the the stream that's being processed and then there's ways to control an external profiler from your code. So you can start and stop it.

A

The cuda memcheck that I mentioned earlier uses the sanitizer api, so you can implement sort of a cuda memcheck. Within your code I mentioned nvtx and the decoration you can include that as a header only source by including that file. You see that right there and if you use wall, clock timers on the cpu. Just remember uh those wall, clock timers are kind of meaningless unless you do a sync before them oops.

A

um So so the last bit is here at nurse we have been developing sort of a modular toolkit so that you can uh incorporate a lot of these features into your code with ease.

A

We call the project uh timory and it it works by sort of creating single handles that you can use to invoke multiple of these apis. So you can, you can combine the profiler start and stop with an mbtx marker, or you can have direct access to the the copy tracing api or the hardware counters it's available for c c, plus, plus fortran, um and in c plus and python.

A

You can get direct access to the data for your continuous monitoring, one minute to wrap up one minute to wrap up guys sure, excellent, okay um and sort of the the key features of this is it's really easy to create new components because they can be composed of other components, um and if you integrate it, if you create a pull request and get it integrated into sort of the native stuff, it becomes. This standalone python class ii that can be used from python um and c plus plus users can create their own components locally.

A

And it's also easy to if you have an existing api to integrate using that into your api- and you know we just saw talk on roof line. I just added this thing. There are components for the roof line on the cpu and the gpu along with scripts for sort of. Oh, this hasn't updated.

A

um There's also a built-in.

A

Empirical roofline toolkit along with these these python scripts, and that is the end of my presentation. Any questions, any questions.

A

If I don't see that um in the chat- oh yes, okay and that will just stay for maybe five or ten minutes sure yeah, the uh the ert that I built in is uh is available. um It's implemented in headers and actually has a has an extension uh so that you know the traditional ert is sort of based on uh doing fma operations to estimate the peak.

A

uh This la actually allows you to replace those via lambdas so that you can estimate the peak of say just the a vectorized, a vectorized multiplication, operation or scalar operation and sort of model it after the the peak for what you think your code actually might look like, because not all of us can you know, have codes that can execute a whole lot of fma operations.

A

I say cool thanks, I'm glad to see more profiling, debugging tools at nurse and uh general hpc community. Thank you. Jonathan.