National Energy Research Scientific Computing Center (NERSC) An Introduction to Programming with SYCL on Perlmutter and Beyond, March 2022, 1 Mar 2022

Previous Meeting

⏯

youtube image

►

From YouTube: 5. Profiling and Debugging

Description

Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/

A

um Okay, so this is this is the the kind of good stuff um if you're writing kind of proper single code, uh debugging and profiling are probably the main things you'll kind of be doing.

A

Okay, so so learning objectives so learn some tips to help debug and signal learn how to profile sickle code for the cuda back end. uh Also, learning about coalesce global memory, access learn some optimization tips, so in sickle errors are handled by throwing exceptions. It is crucial that these errors are handled. Otherwise your application could fail in unpredictable ways.

A

So in signal there are two kinds of error, synchronous errors and asynchronous errors, uh so asynchronous errors will typically only materialize when you call weight on you know a queue on an event whatever uh yeah, so they will only appear when weight is called on a qr event.

A

So if using default, constructor cues use sickle device filter to run code on the host. This is a really important debugging strategy, um yeah. If, if it's not working on the device or if it's not working on the host, it's definitely not going to work on the device, but if it is working on the host, it might still not work on the device.

A

But you need to take this box to make sure that you don't have any problems on house first, you can also specify the thread pool size uh is equal to one which makes it completely serial um the execution for uh the queue which is not necessarily the default, uh but yeah yeah worth worth. Knowing um so normal tools like gdp, um normal c plus, plus c tools, gdb val grant, uh can be used with normal sql code um in kernel. Printfs are a great way to debug code from device, so yeah anytime.

A

Once everything is working on host and you've done gdb and file ground and whatever and everything's fine, then, if things still aren't working, then printfs are the way to go uh yeah. So here we have c device, filter host and then uh single q thread uh thread pool size is equal to one and then just running gdb, and then also uh we can do the same with file ground. That kind of thing, okay, so yeah the very basic uh debugging strategies, but very uh useful, okay, optimization.

A

So one of the most important things, uh one of the things that will dictate performance significantly is how you access global memory. So when you're accessing so memory, access patterns can significantly affect performance, and this is especially important when reading or writing to global memory. Yes, so the main thing here is global memory. So, if you're accessing say local memory or shared memory is called in cuda, it's not as important, but certainly global memory is very important.

A

Essentially these are your work items. You want them to be accessing adjacent uh parts, adjacent um bits of of memory. You want this work item to be accessing the the the element in the one next to it and so on. Okay, you want to be using because essentially, what the memory manager is going to do is it's just going to load a particular say line of this. um This memory and it's not gonna, discriminate it's not gonna. Let's say I only wanted that and that and that it's not gonna discriminate, and it's not gonna.

A

Take that and that there's a um there's, a cached line, size that is default and then every time we ask for memory, it'll get you this amount and then we want to make sure that we're using as much of that as possible. So this is really good. Coalesce global memory access.

A

So this this is true for reads and writes.

A

um Yes, so 100 global access, utilization, okay, this is quite bad. So if we're accessing every second part of memory, then half of this this load is has gone to waste and we see that these work items have nothing to do. Okay. So in the worst case scenario, um these would actually maybe be waiting on these to get their their data potentially.

A

So this is not good, so we want to avoid this. We want to when we're writing code, we want adjacent work items to be accessing adjacent spots in memory.

A

Okay, uh this is yeah a word of caution, so index flipping, so signal ranges and then the ranges are row major okay, so uh that essentially means that, for some kind of say, two-dimensional three-dimensional range so work item with sickle id ij is neighbors with I and j plus one okay, so obviously in all c, plus plus c kind of um or at least most c c, plus plus apis.

A

uh The data is structured in a in a row, major format, but we're talking about threads here, we're not talking about the the data itself, we're talking about uh how the threads are organized in these ranges.

A

So in cuda threads I j and I plus one j, our neighbors, so cuda organizes its work items it's threads uh in a column, major format. So we need to make sure that we're aware of this and that well, a good rule of thumb is not to calculate a linear index manually better to use the member functions. Get local linear id get global, linear id okay.

A

So what happens? If we do it manually and we, we use a row major data and column major memory access, so we'll still have uh we're still using all of the elements, but essentially because we're we're understanding this thread to be adjacent to that thread, as in organized in a column major way to be adjacent to that thread, that that thread.

A

um We're kind of messing up our memory access patterns um on the a100. Potentially this isn't actually a problem. Maybe the the latest uh nvidia hardware is able to to work with this and and it's okay, but as it is um with the previous example, where things were just flipped the the other way around, um but I would say that you'd still suffer some performance uh loss with this particular kind of access pattern.

A

By contrast, if you know that you're using real major data and also a row major work item kind of layout, then you can access these very very nicely and this is guaranteed to be to be optimal. This might also be you know optimal, but this this this is guaranteed to be the best memory, access pattern that you can that you can get so again to avoid this, do not calculate the linear id yourself use these member functions.

A

Okay, so a few very, very quick, so obviously you could you could talk for a long time about optimization strategies, but um very very quickly, so different problems are optimal for different work group sizes, so you should test them and uh kind of benchmark and see which is the best and then stick with that minimize memory transfers. This kind of goes without saying um memory, transfers take time, uh going from hosted advice, device, hosts and so on um in general, prefer malik device over malik shared.

A

uh This is true if um there isn't any physical shared memory. If there is physical, shared memory, then this will be. You know, amazing, but so, for instance, if you're running on an intel uh kind of unified, graphics, uh cpu kind of thing, then you you have physically shared memory and if you're using this, this is really great, uh but in general, um saying cuda.

A

This is done with cuda malik managed, which relies on page faults to essentially uh move move values here and there and that's that's slower than than explicitly moving moving data around so very, very easy, optimization inline functions uh if you're calling a function from within a kernel, then inline it see.

A

If you get a performance gain recently um on, our team, uh teddy was running a benchmark and he managed to get a 30 speed up just by inlining like a list of functions just by inlining, so it it can really give a a good um yeah. Good performance gain uh use local memory where possible, uh yeah. If the the algorithm, the lends itself to local memory, then definitely use it as opposed to using global memory.

A

uh Keep work groups converged where possible. Okay, so make sure that um there isn't too much divergence in the control flow uh within a particular work group, newer hardware: uh it does better with this, and you know things like independent forward. Progress of work items is possible uh for new york kuda devices, but you know a rule of thumb is try and get most of the try and get your work groups as aligned as possible in their um in their kind of execution.

A

uh Also very easy thing use a single native namespace, uh for example, signal negative sign if the native accuracy is tolerable. So once you use the signal native functions, then you're you're, just relying on the the precision of the um of the native functions, and, if that's okay, then that's great, if not so using the the sickle. Namespace functions have certain precision guarantees, but perhaps they're not necessary, um but the native functions usually have less precision, but not always not always, um okay, so profiling, so standard nvidia tools are still available um when you're profiling.

A

uh It can be really useful. So we talked about this. We touched on this very very briefly earlier that you can name your kernels, so you can name a kernel just by saying class, my reduce kernel or something like that, and these names need to be unique.

A

You can't have um this parallel for having that name and then that parallel for having the same name with that single task having the same name, so this can be useful just because the the output that you get from a profiler is usually um sort of verbose and has a lot of words in it. And if you can spot aha my reduce kernel, then it makes it easier.

A

um Okay, so ensis ensis, uh again part of the nvidia toolkit kind of profiling, stuff, okay, so this can be used for tracing and also for for timings. uh So a very simple uh tracing, uh very simple kind of timing output, which is similar to nvprof. uh If you've used that before um it's just ensis, uh that should be a space, there should be a space. uh It should be ns's profile, I'll confirm that in a second um there should be a space after that and then um you'll get this kind of an output.

A

This gives you good kernel statistics, so it kind of um like this kernel submission will give you a little bit of a kind of long-winded name, but you know it's: okay and it'll, give you the timings and the timings of mem, copies and so on. So we can see that them copies are actually a pretty significant time of our total kind of total operations, which is sometimes not always.

A

uh Ncu can also be used for detailed kernel analysis. If you want to measure things like occupancy, you can just use it in the in the usual way. um So yeah you can run your ncu command like this and then get all this output, so things about block size grid size. uh There should be things about theoretical occupancy achieved occupancy, so this is not very good. This is achieved occupancy of 6.11 over 50, which is not that good. But this is, I think, a bad example that I tried to.

A

I tried to make bad.