National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 6. Nvidia Nsight Developer Tools -- Max Katz

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

All right, so I'm going to say a little bit about our profiling and even a little bit less about our code correctness tools.

A

Jeff larkin already showed this slide here, so I won't go through in too much more detail, except to just repeat that the nvidia developer tools suite has a focus on both code correctness tools, as well as um code development tools like ides, and then the other half of the picture is profiling tools, so insight systems inside compute and a related tool suite.

A

We call the nsite family of profiling tools, the profiling side of the developer tools, family and there's really two tools that are relevant in the context of hpc insight systems is always going to be. Your starting point inside systems is a high level application profiling tool. So it collects information about the an application as it runs and it collects information also about the state of the system as it runs. So you can understand what was happening in your application as a function of time.

A

You can also understand how the hardware on your system was being utilized, so you can see information about cpu core usage as well as you should usage high-level usage of the gpu.

A

This is pretty useful for being able to answer high-level questions like what percentage of my time was. I spending on a gpu when I ran an application.

A

Where was I spending time transferring data between the cpu and the gpu, and um where are my kernels that I'm launching you know brent mentioned that a kernel is the kind of atomic unit of work that gets launched on a gpu? Where am I launching kernels that don't have enough work to be able to use a gpu effectively brent mentioned that modern gpus have thousands of cores and that you really want tens of thousands or maybe even hundreds of thousands? Oh, my slide must have a slide transition.

A

There um have tens of thousands or hundreds thousands of threads. You want to have that many launched at one time in order to use the gpu effectively, so insight systems will help you answer. Am I launching kernels that have that if you find using insight systems that you're launching kernels that only use a thousand threads there's a very good chance that you're not using the gpu as effectively as you could now once you if you have identified, or once you have identified that a specific cuda kernel? What um is the bottleneck in your application?

A

Then you would switch to n set compute to analyze that kernel. Now we will talk more about cuda c plus plus tomorrow, as the kind of low level program model for nvidia gpus. That we've been referring to the one thing that I want to emphasize that's a nice feature of the nvidia platform is that all programming models that run on nvidia gpus generate the same kind of underlying code, and that means you can analyze them equally or or without any sort of favoritism towards any particular model.

A

So kernels that are generated via c plus stood power like matt was just talking about, can be profiled on the same basis as openmp journal, kernels and fortran, or open acc kernels in c now.

A

The way that you analyze, the code of course, will be a little bit different because it's different source code and for some models we may have additional uh support for um bonus profiling functionality, depending on which uh program model you have, but at the core they all generate the same kind of code, and so they can all be analyzed in the same way.

A

So I've used the word cuda kernel here to emphasize the fact that in some sense, every program model generates cuda kernels, whether or not you're explicitly using cuda. um But you don't you don't necessarily have to know that when you first start profiling on the nvidia platform.

A

Okay, so you once a particular kernel has been identified as being the bottleneck in your application and site. Compute would allow you to analyze in detail the performance of that kernel. I've also shown you here insight graphics, which is kind of the um corresponding tool to end set compute for for graphics performance analysis. Presumably that's not relevant to most, or probably all of you now inside systems uh talking in a little bit more detail about it is our system level, slash application level profiler?

A

So it's it shows you all of the threads and processes that have been profiled during the application and it shows you what they're doing on both the gpu and the cpu.

A

This allows you to identify gaps in the timeline where nothing is happening on the cpu or the gpu, and then you can use that information to go back to your source code and think well. Why is it that I'm not doing anything on the gpu at this time? Right? Maybe you unintentionally left some part of your code on the cpu and you want to move that to the gpu. So anti-systems is useful for showing you at a given time.

A

You know in wall time uh relative to the beginning of the application run, um why you know. Where am I uh on the cpu and gpu insight systems can profile both one or many cpu cores and gpus. So, if you're running at multiple gpus on the same server, for example, if you have a multi-gpu server, you could profile an application that uses multiple gpus and see them all. On the same timeline.

A

Insight systems is supported on multiple architectures, so we have support on linux, windows and mac. The generally speaking, we also support collecting the profiles themselves on both linux and windows, because those are the places where you're likely to be running in video gpus and because of the way anti-systems is architected. You can collect the profile remotely on a remote cluster and collect it without the use of a gui and then bring that store.

A

That saved report back to your local system and then analyze it in the user interface there, and you can do that cross-platform sorry. I didn't realize the side transitions over here. um So, for example, you can collect a report on linux and then display the collected report on the user interface, neither windows or mac, and that's the workflow that I'll show you in a second.

A

Diving into a little bit more detail, this is a screenshot of what inside systems might show you.

A

It has rows in the timeline and each row is uh shows you information about, what's happening for that kind of section of analysis as a function of time. So time is increasing from left to right here and each row traces or collects information on a different kind of data. So in this example, information about what's happening on the cpu cores and threads, is in the upper part of the screen. So you can see a bunch of different python processes that are being launched and each of those pyth. Actually these are python threads.

A

Each of those python threads is doing some things. It might be calling into the operating system, slash linux kernel to do some work and maybe calling into the cuda runtime api to launch kernels or to transfer data between the cpu and the gpu.

A

You might be calling any number of other apis that emphasis systems knows how to collect information about it can then show you in this. In this example, in the bottom half of the screen, the gpu centric view. So as a function of time, where are all the kernels that I launched from the gpu as well as where of where are all the memory copy operations that are happening on the gpu and in this example we're showing you um a multi-gpu run? So you can see for cases where using multiple gpus in the same workload.

A

This example is a deep learning training run. You can see as a function of time where the gpu is active and when are they not active.

A

You can then zoom in to particular areas of interest. So you know this screenshot was a relatively large chunk of time in the hundreds of milliseconds that we're looking at here. Often you'll want to zoom in to a very particular slice of time and understanding more detail what's happening there, so you can just zoom in in the profiler, and this is an example zoomed into a much smaller chunk of time. You know like a small fraction of one millisecond and then you can get information on individual operations.

A

Individual kernel launches individual memory, copy operations, that sort of thing.

A

To collect a profile with nsi systems, you, generally speaking, the workflow that I will recommend, is to collect the profile at the command line, using the command line interface and then later save save the report to a file and then later ingest that into the user interface for analysis.

A

The excuse me the name of the command line interface. Forensic systems is called ensis and sys and nsys has a number of different modes that it can run on. But profiling is the most common one that you'll use, so it's nsys profile and then the name of your application as you'd expect it has a number of runtime options. In this example, I'm showing you two of them dash o, would be the name of the report.

A

File that's being generated, which has in the newest version of enter systems, has the dot nsys dash rep um file extension, but in older versions like the one we'll be using today, um has the qdrap file extension and then dash dash stats. Equals true means that I want to generate a summary uh at the end of the run of all of the gpu and cpu related activities.

A

That inside systems knows how to summarize for me, and then you can see just the name of the application you can use this in multi-rank runs the most common workflow would be to put nsys profile after mpi run, or s run, but before the name of the application, so mpiran, n4, mp4 and sys profile, etc.

A

um Typically, you'll want to do a little bit more work here, because you'll want to give each report file a separate name for each rank and then sys does understand how to inject environment variables into the report name. So you could say something like. I want each report a separate report file to have a separate name corresponding to the mpi rank in question, and you just use the mpi environment variables for your implementation to do that or you can use like slurm proc id if you're using slurm.

A

It is also possible to put nsis profile before mpi run and then in that case it will profile everything that's launched by the the job launcher, but because at this time inside systems doesn't have is not multi-node aware. They'll only profile ranks they're launched on the same node as you, so using nss profile, and then mpiron is primarily only useful for single rank single node cases where your profiling, from or you're launching, from the same node that you're running on which doesn't happen.

A

There are some clusters where that's basically not possible and then typically, if you're running like on a workstation, that's easier to do.

A

um Let me just show you a couple I'll show you a profile example in inside systems now um so.

A

One second.

A

Now in this, um this is an example of an inside systems profile that I collected, um I think on permuter a few weeks ago, and you can see what the end side systems uh user interface looks like the it's when you first open it. um This is basically what the airport open looks like when you first open it you'll see a timeline view in the upper half of your screen and then an events view in the lower half of your screen.

A

The timeline view, as you can see, shows you as a function of time, what's happening in your application and I'll talk more about this in a second. The events view would be useful for seeing discrete lists like a table of specific events of particular operations.

A

So, for example, if I click a particular row and then click show in events view, then it will create a table of operations that I can then see down here and in the case of nvtx, which is the instrumentation api that you can use for annotating your code with human readable strings. You can get a nested view of what's happening in your application, so this is an application that launches a bunch of time steps and then I could filter down through individual time steps and then see what's happening so in this screen.

A

In this example, here the what's happening on the the cpu side is in the bottom half of the screen and then what's happening on the gpu is the upper half of the screen. Any row can be expanded by clicking the um the arrow here and then now that I've expanded the cuda row. I can see a whole bunch of operations that are happening on the gpu generally speaking these uh and I can zoom in so you can see it better.

A

Generally speaking uh in the cuda up overall cuda row, which is this open uppermost row, uh kernels compute operations are blue and then memory copy operations. You know transferring data between the cpu and gpu are green and red, and then, if I scroll in far enough, I can then find any particular kernel operation. So, for example, here's a kernel, an individual, discrete piece of work. This example using cuda but you'd, see the same kind of presentation. Regardless of what program model you used.

A

You'd see basic information about the kernel, which would tell me information like how many threads total did I launch in order to um do this kernel and that can give you a sense of am I launching enough work to even fill up a gpu, so inside systems is super useful for understanding as a function of time. um What's going on, you can see that this chunk of the timeline here has zero activity happening on the gpu. That's normally a bad thing.

A

If I'm using a gpu, the gpu is so much faster than the cpu in almost all cases, as brent discussed earlier, that any chunks of the timeline where you're not doing gpu computing are a challenge to your ability to scale and to get optimal performance.

A

In this particular example, I had turned on mpi tracing, so this allows you to understand where mpi calls are happening in your application, and you can see that this chunk of the timeline on this rank happened to nicely correspond to an mpi all reduce operation. So that explains to me why it is that there's no gpu activity happening here, because I'm waiting on an api operation in order to finish, you can also see calls into the cuda runtime api, and these can be useful for understanding.

A

When did I launch kernels from the cpu that can be later seen on the gpu, and you can also see calls to other apis like cuda device synchronize, which say I'm waiting here for some outstanding gpu work to complete.

A

Okay, switching gears now to- and you can of course ask me questions beyond switching gears now to end site, compute and site. Compute is our kernel, profiling tool and it allows you to get detailed performance analysis of individual kernels in the screen or in the profile that I was just showing you. You could get some information about the kernel like how many threads it launched how long it ran.

A

But if you really want to understand in detail what were my threads doing and and what was the major limited performance within the kernel, insight compute would be used for that.

A

It again is supported on linux, windows and mac and uses the same kind of workflow where you can, if you're on linux or windows, you can both collect a report.

A

If you have a gpu as well as display report and all three platforms, you can display a previously collected report.

A

Insight compute works by doing different kinds or levels of analysis, each of them asking a different question about the performance of a given kernel as one example, the one that you start with the one that's at the top of the inside compute report. When you open it is the gpu speed of light section, and this gives you high level analysis for how much of the compute throughput that was available on the gpu did I use and how much of the memory bandwidth did.

A

I use this is presented in a bar chart where sm percentage is a measure of the compute throughput and memory percentage is the percentage of memory bandwidth that I use. Generally speaking, if you are using, you know at least 60 or 70 of one of those two um subsystems on the gpu. That's a good sign that you're using the gpu effectively, if you're using less than that, then that could be a sign that you aren't using the gpu to its full potential.

A

Brent mentioned that the latency of individual operations is relatively high on gpus and really there is no way around that. That's a fundamental design characteristic of how gpus are designed. The only way to compensate for the fact that the latency of individual operations on gpus is high is to have a lot of those operations going all at once, so that any individual operation can have its latency hidden by operations and other threads.

A

If you aren't able to achieve a high fraction of the peak throughput of the gpu, that's usually a sign that you don't have enough threads in flight to hide work and brent mentioned that you know. Threads in flight is really can be thought of as a measure of, in many cases at least how many independent items of work you have to do or how many degrees of freedom you have.

A

So, if you think about a for loop from, I equals one to n in fortran um or in c, then that for loop, um the number of the trip count of that for loop. You know what n is will essentially in almost all cases, be a measure of how many threads there are. If n is a thousand, then I launch a thousand threads.

A

That's not enough to fill up a gpu you're likely to suffer from that exposed higher latency on the gpu, and then you may not see you know, 100 of the gpu throughput or anywhere close to it.

A

There are other sections as well, for example, there's a section that shows you analysis on the memory movement in the kernel. So it shows you information on memory movement, both within logical sections of gpu memory, for example, global and local memory.

A

These are things that you can learn about if you read the nvidia cuda programming guide and also it shows you movement between the physical memory spaces in the gpu brand, to mention that there are many ways in which gpus are different from cpus, in particular, things like having smaller caches, but one way that they are similar, that they still have the same kind of caching architecture, where you have a main device memory, which is the main store of your of your simulation data or your your data set that you're calculating on and then multiple levels of caches and then the actual registers, which are the source and destination of individual compute operations.

A

And you can see movement between the main memory and the caches, and it helps you understand for detail for people who are willing to do detailed analysis. What's going on in your kernel again for those of you who are just getting started on gpus, you wouldn't start here. This is a relatively advanced level of analysis, but for those of you who really want to dig into a particular kernel, you can then use insight compute for this and site.

A

Compute also allows you to look at the source code of your kernel, either the assembly code or the original high level c or fortran code, and then correlate that with some number of metrics that we have sampled through the kernel. So what we can do is every some number of clock cycles. We can record a sample and that sample will record where we are in the code, as well as some information about why we are waiting if we are waiting at this point of the code.

A

Why are we waiting there, and this allows you to both say roughly speaking, where are we spending time in my code, and why am I spending time there? Is it because I'm waiting for some data to arrive before I can compute this operation? Is it because the latency of this particular arithmetic operation is high? That sort of thing this level of analysis is quite tricky to do for two reasons.

A

One is that understanding the actual data that we're presenting to you does require some understanding of the gpu architecture, and that requires practice and experience, and the other is that gpus are, by their nature, very highly parallel, and so these are think of them as averages across all the threads that are being launched from the gpu. It's not the the work of any one particular thread, and that requires you to think in a fundamentally parallel or averaged way, which is non-intuitive for somebody who thinks primarily in terms of serial cpu threading.

A

It is possible to get insight out of your source code. This way how, however, I will caution again. This is a relatively advanced level of analysis and not something. I recommend that you start with to collect a kernel with nci compute.

A

The name of the command line tool is ncu.

A

Ncu is just a short name for insight, compute and then, similarly to end site systems, you do ncu and then the name of your application, the um it has multiple options to the command interface. So, for example, dash o like with inside systems, is the name of the report file to store it in insight.

A

Compute because it does such a detailed analysis on individual kernels has a high level of overhead associated with it, because we cannot collect all of the profiling data that we need for a kernel by only running the kernel, a single time and say compute has to rerun that kernel many times in order to get sufficient profiling data statistics on that kernel, the number of times it has to rerun.

A

Your kernel depends on how much data you're collecting, but it can be 10 or sometimes even 50 or 100 times, so, of course, that can make your runtime of your application 10 or 100 times longer. So if you profile every single kernel in your application that will introduce a relatively large level of overhead, I recommend not doing that, but profiling, only specific kernels that are the ones that you care about in any given analysis, iteration and you may so, for example, dash k, says I only want to profile kernels with a specific name.

A

You may even specify even further than that, for example, by um only profiling, uh some number of counts or iteration counts of that kernel, not every single iteration count, but uh so you will want to be careful about how much data you collect with insight compute. But this is the workflow that you would use.

A

And then just a quick look at how this looks in the um and the viewer, if you can't see inside compute user interviews, let me know.

A

This is an example of what I get when I look at and set compute. As I mentioned, it has these different kinds of sections of analysis. The first one is this gpu speed of light, and then, if I expand any given section with that arrow, it gives me a lot more detail about that analysis. So, for example, here's that bar chart that I was showing you with both compute throughput and memory throughput this in this particular kernel.

A

You can see memory usage was 33 percent of peak throughput and begin with, and compute throughput was even less. It was 17. So this is an example of a kernel that isn't using the gpu as effectively as it's possible to use the gpu and then potentially, if you could improve it, you would. Although this is sometimes that's not easy, depending on the way that you've written the code, we can generate a roof line chart which allows you to see your.

A

You know your flop cap, your flop rate as a function of arithmetic intensity for those of you who have used refine analysis and other tools and then a number of sections down below for different kinds of analysis. There's multiple pages in insight, compute, if you click from page to details to go pages to source. You would see this source code view and then you could scroll down through your source code and then see as a function of line of code. How many instructions were executed?

A

How many samples were collected at that line of code to build up a sense of the relative expense of different lines of code and why we might have been styled at a particular place.

A

And with that, I think that I will stop I'll just leave here and I'll again, we'll share these slides, just a couple of resources or links to other information.

A

In particular, we did a joint training with oakridge and almost two years ago now it's crazy how fast the time goes, but um on both inside compute and touch systems. We did very uh much more detailed presentation, hour long presentation on each of these tools. So if you want to learn more about that, you can go consult those training sessions that we did most of the information. There is still a pretty good sense today of how to use these tools.

A

I have yeah, so I'll just stop there and if there are any questions, how long do you want to read out to me? Please go ahead and do so.

A

And envy prof be deprecated, that's insane chat. Envy prof is currently in what I would call or describe as maintenance mode, which means that we're not adding any new features and while we're not formally calling it deprecated, it's functionally deprecated right in the sense that, like unless the tool actively breaks, we're not going to be adding any new functionality to it.

A

Nvpof already doesn't support the latest gpu architecture, so you can't use nvprof on the a100 gpu that's on promoter, so if you're, if you're still using mvproff today, I strongly recommend using insight systems and insight compute and there's a blog post that I've linked here, which tells you about how you would transition from envy prof to systems. If you want to do so, and it will be true that all future gpus, including both a100 and any future gpus that we launch will not have any prop support.