National Energy Research Scientific Computing Center (NERSC) Introduction to GPU Training, February 2020, 14 Mar 2020

Previous Meeting

⏯

youtube image

►

From YouTube: Intro to GPU: 07 Profiling on GPU

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So welcome back everyone I'm max Katz, for those of you who joined just for the afternoon session, I'm the NVIDIA technical contact to nurse and it's my job to help train everybody on how to use GPUs effectively, and so in. This talk, I'm going to talk a little bit about profiling tools for NVIDIA GPUs.

A

So this talk will be fairly specific to how do you do profiling on NVIDIA GPUs I think that most of the concepts apply generally to profiling on various architectures, and there are similar types of tools for other GPU vendors and other CPU vendors, but today, I'm really gonna focus on what are the Nvidia to allow farangs for profiling, because I think that's your best.

A

First start when you're getting started on GPUs, so the the Nvidia profiling tools are broken up into a family of tools called the end site, developer tools, and so the way this works is that we have a set of three tools that can be used for various parts of the profiling analysis workflow. So whenever you first start analyzing the performance of a code, the most important question is where's, the time being spent right.

A

You want to be able to as a function of wall time in your application ascertain how much is in each part of the program or that's the most important thing, because then you know what is your bottleneck? What is the most expensive part of your code and usually that's the part that you dead and go in optimize right? You don't want to optimize the part, that's the most fun to optimize or the part. That's easiest right, you really, the most bang for the buck comes from the part that you're spending 60 percent of your time.

A

If that's what you have, of course, that's not always the easiest thing to do right. Sometimes you can't do that or you needs to wait on some of the refactoring, but it's nevertheless always important to know where time is being spent, and so you don't go pre optimized right, you don't you before you go and optimize some code, you need to have a very clear understanding of what is the possible benefit?

A

You can get from this and if it already only takes one percent of your time, then spending your time on that part of the code doesn't make a lot of sense right. So, with that in mind, insight systems is the name of our tool that is designed to collect a timeline of the activity on your node, and so, as I systems can be used for doing system-wide application analysis, you're really entering on my node where's, the timing. How much time is on the GPUs?

A

How much time is on the CPUs and like getting a stacked up view as a function of time? What's happening on your node, then once you've identified a particular part of your workflow, that is the problem. It often breaks down into two categories. One is that you identified that memory is your most important bottleneck, in particular copying data in between back and forth, between the GPU in the CPU, so enzyme systems will barely clearly tell you whether that's happening or not, and then sometimes the bottleneck is the actual compute workload.

A

I would say most of the time. The first time you port, two GPUs. It's going to be memory, that's your bottleneck! You either spend a lot of time, allocating memory or copying it back and forth between the CPU and the GPU, but once you've gotten that all worked out. Typically, your compute workload will then be the most important part of your time, and you want to analyze that now on NVIDIA GPUs, discrete chunks of work are called kernels.

A

Regardless of what programming language you use, there is a discrete chunk of work that gets launched on the GPU that has parallel work to do in the CUDA context. That was the global function that we saw earlier in opening CC. That would be a specific parallel write. A row same thing for OpenMP within a target teams, distribute some common statement so in the terminology of Nvidia, those are kernels and inside computers designed to pick a particular kernel and analyze that, and so that's when you identified a particular loop. That is the problem.

A

That is your bottleneck now and you want to understand what is the bottleneck for that loop? How can I optimize that loop there's also another, and so the tool that is used to do that is called n-side compute? There's another tool called insight, graphics, which is for people who are doing like graphics, optimization I'm, going to assume that that's, not anyone in this room, but it is important for the people who do like game development on the Nvidia platform.

A

Of course, I'm only showing you, the profiling part of the Nvidia tool chain, whose son kindly showed you the debugging tools. There are a couple other tools as well, but I think for profiling at anti-bugging covers the most important part of the offerings, so once I systems, as I said, is designed to give you a timeline view of what happened in your application.

A

If you look on here on the right-hand side of the screen, we have a kind of screenshot of what this might look like for a fairly complicated application up here at the top of the image you see workload, information about the CPU by the way I'm going to give like a live demo of this tool. After these slides are done. So don't you don't have to squint or I'll, make a little bit easier to see but I'm just giving a sense of what you would see up here at the top.

A

You have like the CPU workload in the middle. You have information about various API is that the sims knows how to track so it can track calls into CUDA into the CUDA libraries it can track. Mpi calls, for example, and then at the bottom you have the actual GPU workloads, and so that's what's down here, where you typically see read for where memory activity and blue for kernels the compute workload, and so that helps you identity' Sting, guaa shat, a kind of glance where's, my time being spent is it in? Is it in memory?

A

Is it in compute or am I not even running on the GPU, which happens a lot more than you think, especially in the beginning, and set systems works in a in a separated host and target infrastructure where the target mode collects data on the remote system that you're running on and then the host is used to actually visualize that data, so you can run it in a command line only mode.

A

We're basically just collect the data and give it a standard out print out what happened, but the more fun thing to do is collect a report on the remote system and then visualize it in the graphical interface and that's what I'm gonna I'm gonna show you how you both do both, but the second one is. It gives you a much richer view. You get information like this that helps you see at a glance. What's going on your application, the with some exceptions. Both versions are supported on Linux, Windows and Mac, the exceptions being on Mac.

A

There's only like a viewer version of this because you're not going to be running on your Mac and Linux power 9, which is what is running on Summit, for example, we only have the collection mode so you'd have to like you, do have to copy it back to your local system and then install and such systems on your own and load it up and some computers, the other half of this workflow as I described. This is where you would go once you identify that a particular kernel is your bottleneck.

A

Now you're gonna jump into this tool, and it's gonna do some analysis for you on. What's going on so under the hood, what it's doing is it's collecting Hardware performance counters on the GPU? It runs your application. It collects these counters, it prints out a report, and then you can load that report into your user interface and I'll. Tell you things like, and my memory bandwidth bound a might compute performance bound or something else, and then hopefully you can use that to determine. Where should I go? Next to my optimization process,.

A

So, for using n site systems, this is the command-line interface that you would use insists is the name of the binary if you do module load CUDA on quarry, for example, this will be in the path, so you can just use it, and sis profile says: collect to run this application and click the profile in this case I've.

A

It's a generic name, my AP XE, and then, if you add the that's, equals true flag, what it does is, it says, collect a report and also at the end post process it and give me a standard out view of like what happened and I'll show you what that looks like in a little, but basically gives you a summary of all the activity that happened. It's not like a ascii implementation of the timeline. It's just a rolled up summary like the average of what was of what each activity cost.

A

It will generate a file that has the extension QT rep, that's just the the file extension, so you know which ones are entering systems profiles and it is possible to use NCI systems for interactive profiling like if you have a GPU on your local, laptop or workstation. You can use it that way, but in the HPC Center context, usually you collect the profile of remotely and then view it locally, and then you might get a view.

A

It looks like this when you, when you zoom, in where you see like the calls into the OS and internal calls into CUDA and then call like the actual kernel. This is the bottom. Half of the screen is the the actual GPU activity that shows you both kernels and compute out operations.

A

If you zoom in even further, you can highlight over a particular kernel launch and then see information about it. So how long it took what kind of resources that used on the GPU that sort of thing. So that's that's, basically the most detailed information you can get about that kernel. If you want any more detailed information, then you would need to jump inside compute target that kernel the name of that kernel and then profile it. So that's the workflow!

A

You would use this tool to highlight, find in your timeline, a kernel that takes a long time right on your timeline, get the name of that kernel and then profile it and then cite compute.

A

So if we're using inside compute and again I'll demo this in a moment what you do, this is the name of the command line: interface, Tensei compute. It is nvn sites, EU CLI, it's a mouthful. I know we're gonna, make it simpler, but for now that's the name and again this will be in your path.

A

If you modulo kuda, if you just run n VN sites, you see lied that my absolute you see that will take a long time and the reason will take a long time is that it's going to profile every kernel in your application, which could be thousands or millions and unfortunately, because of the way in video GPUs architected, and this might be true for every vendor, I'm, not sure we cannot collect an arbitrary amount of profiling information in one patch through your application.

A

So the way we do this is we automatically rerun your kernels for you enough times in order to collect all that information, and that might be 50 times 100 times, and so you can imagine that'll take a long a little bit longer time than your application normally runs, and so it is important to specify which kernels you want a profile, or else they're just gonna be waiting forever for this profile, so the k option says can accept a string which can be the subset or the full match of your name.

A

The kernel name says anything any kernel name who includes my kernel in it will be profiled. So you want to tune that carefully and there's a further option. You can do it even further limit the collective collection ever it's taking a long time as within sign systems, you can use insight compute to actually drive the application.

A

If you have like a local workstation but again, most of I think you, if you're running a nurse will be doing it in the collected on the command line remotely and then visualize it locally I workflow by the way I haven't these. Are generic slides not specific to nurse? So if you were running on earth, you need to run s, run n 1 and then then the thing just see you know because the GPUs aren't visible. As we talked about from the from the allocated node. Why do you fell?

A

And this is what the interface looks like France, I compute again, I'll show you this in a moment, but it's broken down in a set of sections, and each section is intended to give you a different level of analysis on different parts of the GPU workload. So the top-level section, it's probably the most important one for you'd, always start with. It's called the speed of light section. The speed of light section is intended to tell you of the various theoretical bottlenecks on the hardware.

A

How close are you getting to matching those theoretical bottlenecks right, so we published that if you're using double precision floating point multiplier ads, then the peak performance of both the GPU is something on the order of seven teraflops right. So this would tell you if you had an application that was primarily doing that instruction.

A

What percentage of that seven teraflops is my kernel actually get right, so that gives you a sense of how much more optimization do I need to do before I hit a hard limit on the GPU beyond which I could not get any faster and I should move on and and work on another kernel. So this isn't to do that the speed of light. There, therefore, is the the peak possible performance, and the percentage of that is how much did you achieve and it will do it will?

A

There are several different enough remember: it gets a little more complicated because there are different levels of memory. Are you as we discussed, there's? What's called global memory, which is basically your DRAM, that's the big chunk of memory that GPU has 16 gigs, there's also levels of cache and any one of those could be.

A

Your bottleneck right, the if, if you're doing a streaming operation like summing two long arrays together, that typically will be bottlenecked by how long it takes to go to DRAM because they'll size, the arrays, will be much larger than the size of your cache. But if you're doing something that can fit in cache, then you will primarily be bottlenecked by what is the bandwidth to the cache or latency to the cache, and so the higher the memory.

A

Harkey has different levels, and so it shows you the what percentage of each level you're at so in this screenshot example, it's telling you that and that the the nomenclature is a little bit unfortunate or hard too hard to parse for for beginners, but basically it's something L 1 cache, L 2 cache and for the DRAM. What percentage of each of those did you get and approximately speaking your your total limiter will be essentially the highest of those 3 right, because the the quote the highest number you get is the is the bottleneck right?

A

If you could make that number faster, then that would become the thing that makes your kernel run faster.

A

Okay, so, as a general rule- and this is getting more into the weeds- perhaps you'll haven't you need to build up some more experience with this, but as a general rule and a kernel that looks like this is not an efficient use of the GPUs resources. It's not getting near the hundred percent that you'd like to get for either the compute, which is the upper bar graph or memory, which is the lower bar graph as a general rule.

A

If you can get 60 to 70% of one of those, we typically say that you're bound by that system right.

A

So if you're, a 60 or 70% of memory bandwidth, then you're you're a reasonably memory bandwidth bound, and at that point you need to focus on like better coalescing memory, accesses or reducing the reliance of memory or buying a more expensive GPU that has more memory event with, whereas the if your compute down, which is the upper row, then that means like you're limited by how fast how many double-precision instructions we can get through or something and again.

A

Typically, you want to be a 60 or 70% of that before you conclude that now, you're limited by how much math can we actually do compute bound is ideally where you want to be, because the peak performance of the GPU is in the compute bound regime when Jack talked about roof line analysis earlier, that's what he was really getting at that.

A

If your memory bandwidth bound, you will never reach that peak performance to the GPU, so the more compute heavy you can make your kernel the better, of course, there's lots of algorithms that don't expose themselves that way at all right, but as.

B

A

Ideal, that's where you want to be ok before I jump into the demo, any questions about the profiling options. Yes,.

C

A

Question was, should I interpret this as saying that the top row should be as large as possible relative to the bottom row. I you should I have, should I try to be bottlenecked by compute rather than memory, and my comment is that's the ideal world, where you're bounded by compute and if you are bounded by compute. Often you will not be bound by memory because you're not to be making that many accesses to memory, because you have a lot of computers directions to do so.

A

That's the ideal world, but don't necessarily make that a goal of your optimization, because if you are trying to add two rays together, the example I gave you can never make that compute bounds right. The best you can do is make it memory bandwidth bound and then just optimize that as much as you can by doing things like making sure the memory accesses are coalesced and that you're achieving as close to 100% as possible.

A

If you, if you were trying to read data from CPU memory as in your kernel, then yes, that would be determined as a limiter. That's right.

A

Okay, so how much time do I have left twelve minutes? That's fine! So I'm gonna give you a quick demo and I. Don't have enough time to like do this full justice, but just give you a flavor, and this is something you might be able to do for homework or not, maybe later not in the today. Necessarily it's a little bit more advanced.

A

But if you want to follow up exercise, you can do that Oh with 2:30, okay, so Matt Norman is a climate scientist at Oak Ridge and has developed a mini, app called mini weather, and so many weather is, is a simple C or Fortran based CFD type code that can be used basically for the type of analysis you would do for weather or climate simulations. So he kindly made that available. It's the community and an open-source code.

A

It's a nice mini app to play with if you want to get familiar with what that community's code looks like, and so that's the type of thing that, like vendors, would typically go off and try to optimize for procurements, not sort of thing. Many weather is an excellent code for doing various types of analysis and I have chosen to fork it and make it and I made a version of it.

A

That is I use for doing for demonstrating performance analysis, and so this is on my personal github I can tell you it's in a branch called MCAT. Slash tutorial so like follow up with me afterwards or I can make it available on the nurse documentation. Something like that. But this is something you can go to and github and get, and it just gives you a few simple problem exercises that you can work on and, as you can see, I was working on this forensically while Wilson was talking.

A

I'm sorry, but I was just rying to touched-up, and so what I've done is I broken it down a set of five problems with 2/3 of which exercise and sight systems and two of which exercise and set compute with the idea that if you go through them, linearly you'll kind of learn this workflow of using n-side systems to get a profiler tool, a timeline of the tool and then identifying what the bottleneck is and then removing that bottleneck I've.

A

Given you some hints to make that hard and then eventually get to the point where now kernels are your limiter and then use NCI compute to profile of the kernel, so I'm not going to go through all five of those I. Don't think we have time for that, and I also want to give you something to do. But I'll give you a sense of how that workflow works and I'll show you how to use both tools so I have cloned this repository on quarry and I'm.

A

I have currently allocated a node on query with s I like the way we described before. So you can see. I did if I do this I get the GPU that's available to me. So I just have a like I've requested one GPU, and that's all that I need for this mini weather itself. As an MPI code, I stripped out all the MPI from my version of it just for simplicity, so you can focus on just the single GPU performance.

A

If I, if I LS here, this is the structure, those I have five problems which I just have in text files and then what I've done is for each problem. I provided a get patch file. So if you get stuck- and you just want to see what was my solution in that problem- you can just do get apply solution, one that patch and then that will basically solve it in a sense so that you can then move on with a tool. Part of the analysis.

A

So in order to build this, you just make it requires having the PGI compiler in your environment. This is an open, ACC, accelerated code, I, basically from mat Norman's code, stripped out everything, but the C open, ACC implementation and then so. That requires a PGI compiler on Cori. You just do module load, PDI and again, ask me if you want, if you're confusing is just ask me afterwards. If so, I can run you to the workflow so now, I have a mini weather, executable and I can just run it with us Ronnie.

A

The way that this works by the way is that text? Oh, so the question was: do people use the NB top tool a lot? This is a tool that I only learned about last week. I think it might have been released somewhat recently or something because I saw it on like several select themes being commented.

A

Basically, NV top is like a from the 30 seconds that I looked at it. It's a tool that allows you to see like a time graph representation of the utilization of the GPU, and so, if you do anybody s, mi there's a number here. This is basically gives us this report on the GPU I know this is a diversion, but this is a useful to everything and the status report tells you things like what is the temperature of the GPU?

A

This is its total potential power, draw 300 watts for volt to be 100, and this is the current power draw. So it's basically idling at this point. This is the total amount of memory available. This is the current amount of memory allocated, and then this number here is the GP utilization. It's a very crude measure of how heavily you're hitting the GPU and basically it says in my last one second of time a well time what fraction of the time was I spending running on the GPU, so this is 100%.

A

It basically says you're running a continuous workload on the GPU, so you can often use that for your various highest level. Questions like am I even running on the GPU. That's a useful thing to do and if I am using, the GPU am I. Only spending like 2% of the time on the GPU and the rest is presumably then on the CPU.

A

So Nvidia SMI has a mode a couple different modes. One of them is called the daemon mode, where basically every second, it will print out a bunch of stats like this and I believe that NV top is basically just a wrapper on this. That runs this in this mode and then shows it to a nice graphical interface. So, if you want to, you can often do something like in a separate window, a separate terminal window load this up right or you could do like you can be more simple with Linux right.

A

You can just do like watch and one Nvidia SMI and then just have it refresh every one second, and then that and then in a second second term, in one, oh and then your first turned on to run an application. That is the quickest way usually to see, did I even run on the GPU.

A

That I mean yes on most platforms: I, don't know how this works on, nor specifically, but on most platforms you can do like s Alok to get a node and then just open a separate tournament, one SSH directly to that node.

A

So that's one thing you can do I, don't I, don't know if nurse-like what how that works on earth, but in general, that's a thing that can be done in HPC centers and you typically don't want to be doing that that often because just more that could work with, but it is a way to get around the fact that, with with slurm, you don't have one s run test going at a time. So you can't s, run your program and then also s front anybody SMI at the same time.

A

So that's that's a trick that I use to get around it. If that is not currently documented on the quarry Doc's will make sure we have described that workflow in case you're, curious okay, so that was an aside and B top and performance monitoring. But that's a very crude metric right. It only tells you was the GPU active, essentially right. We want to get better information that we want to get like what actually happened on the GPU, so I ran mini weather and it gives me a counter of where I am in time.

A

It tells me what the time step is. This is like a CFD grid code, so it bounces a particular time step and the first two numbers, NX, glob and NZ glob are the number of zones, and this is a two dimensional grid code, so it's X X zone as these zones, so there's total of 800 zones in the current version of the code in this simulation. So what I'm going to do is I'm going to run insist profile on this application and as the binary for Ensis and site systems, as we discussed.

A

So that will do two things for me. One is that it will give me I will capture to disk a report file that is like a like a binary, opaque thing. They record all the events that the entire system is captured and then the second thing that it does is it post processes that report and it gives me a list of everything that happened.

A

So this can be a little bit it's why it's so like you might want to pipe it to a file and then look at it, but basically what it does is it gives you several different sections on memory operations and kernels. Remember: kernels are the compute workloads, so if I scroll up to the top, what I see is these setup sections? So this is a top my report, so the first section is the kuda api.

A

So the kuda api is what the cpu call is to launch work or do things on the GPU and that's broken down into memory allocations, memory, transfers and then what we call kernel launches, which is actually launching the work on the GPU and it's broken. It's ordered in descending order by amount of time. So basically, what this row is telling me here is that of the CPU calls into CUDA not to see not to run down the application just of the parts of CUDA API that were tracked.

A

98% of that was in this API call ku mem host Alec, which you know I, don't expect you to recognize, but basically is memory allocating right. So this is a symbol that gets called when you do CUDA malloc or some other like opening CCAC, see data, create or open MP map, or something like that that that's the memory allocator thing that's called so the is dominated in terms of the CPU view into it by memory allocation from kernels. It gives you a descending list of all the kernels that were nm GPU openings.

A

You see an open MP, have a nice property that they tend to generate nice kernel nice-looking, kernel names that are the name of the function in this case is the name of a function in the code, said he'll olz and then a line number in the code. So that's super useful for, like locating where that loop was, if you're doing like template a C++ code.

A

These names are not as fun, and basically this tells you that of the time spent on the GPU running, work on the you compete work on the GPU about half was spent in this set halo values. Eve function tells you how many time in nanoseconds how much time in seconds was spent. In that how many times it was called and like the average I'm min and max for that kernel.

B

A

These are kernels independent, so if you sum them up, it'll get a hundred percent and then finally down here you have memory operations, so these are transfers between the CPU and the GPU and basically, what it tells you is that about half of that time was spent in and CPU GPU transfers, which we call host to device or H to D and then about half in the other direction. Now, if you added up these various sections in nanoseconds, you could make some inferences about.

A

Where was my time mostly being spent, and you could do that or what you could do is just load it up into the graphical interface, which is what I'm going to recommend that you do, and so, if I look in my directory, I now have report 1 dot q d rep. So that is my generator report file.

A

So I have a second terminal window open here and then what I'm going to do is I'm just going to choose the workflow of SCP in the file down to my laptop and then viewing it in my local viewer, it's pretty common workflow. So, if I print my working directory, this is where I am on quarry, so I'm gonna copy that path and then copy this file down to my laptop and hopefully I will type things correctly. I will forget. I will not forget.

A

This is on quarry, so now I've copied that down to my system, I already have insight systems opened up in my laptop. If you want to get this, it's a free download, used, Google and anti systems and you'll find that the download link yes.

C

A

The question was: does anti systems replace env prof for those of you who have done GPU programming on video before there is a an existing appropriate tool called env prof and a paired graphical interface called nvidia, visual profiler enemy p, and the answer is yes. That tool is now in what I would call maintenance mode. It has not supported for essentially, we've stopped active feature development as of the current GPUs of elta and then the next generation GPUs that will be on promoter. This will be the only supported profiling tool.

A

So basically, that's why I'm showing you this tool? The current profiling tool works on core GPU and we prof it's not it. You know it works, there's no, there's! No bugs that I know about you're, welcome, to use it but I'm sure I'm intentionally, showing you the tools that will be supported on Perlmutter, so I open up this report when kini rap file and I get a timeline view like the following.

A

It's broken down into two seconds as I said earlier. The top part here is the CPU workload so like this black bar here is a measure of load on the CPU. If it's evidence, the bar is 100% that basically means I'm hitting a CPU threads heavily, whereas like here towards the beginning, I'm just sorta starting to spin up my application.

A

The CUDA API is on this row, so this is telling you all the calls in the CUDA there's a big chunk here for kuben post Alec, so that kind of lines up with our thing we saw before and that's telling you that of the time spent in the CUDA API most of it's in that memory allocation. But now we can see that this is a pretty darn big chunk of the application run time as a whole right.

A

The the entire run time is from the left bar here to the bar on the right and it's 800 milliseconds, so yeah it'll come back up, I'm sure, well, I'm, not sure, but I hope, I, don't know what I can do. I mean I'm not connected to it, I'm just going to the zoom. Okay! Sorry about that, for those on WebEx we're just screen difficulties: oh I'm, sorry,.

A

So 800 millisecond, you can see a big chunk of that I can I can do something like this to highlight a section and look at the timeline and I see that that's you know like 300 milliseconds, that's almost half of my run. Time is spent in this memory allocation and then the second part down here. The CUDA session is where the actual runtime occurs. So I can expand this and it's broken down to kernels, which is the compute and memory which is the memory transfers.

A

And if you look, if you have sharp eyes you can see, all of it is right there. All of the compute workload is right there in that part of the application, and that's that's pretty tiny right. That means this application is not spending much time on the GPU and the main message here that I would take away from looking at this profile, and this is basically exercise. One in my tutorial is. This is not a workload that works well on the GPU. Even if every one of those kernels is optimized, it's fully optimized.

A

It cannot be better you've written the best possible CUDA. This does not make sense to run on the GPU. This workload will be faster on the CPU for sure, and it comes down to two things: one. It takes time to spin up CUDA, or this is the GPU in general. It takes like something like half a second to one. Second, actually get everything loaded onto the GPU.

A

A second memory allocation is really expensive on GPUs, there's fundamental hardware reasons why that's true, and so basically it is much more expensive, allocate memory on GPUs than you're familiar with on CPUs, and so, if you're Dorf by the amount of time you have takes to allocate the memory, and you only spend a tiny amount of time running with it. That's not a good use of your time so think carefully for a second. What is the answer, then? How do I make this application run well on GPUs?

A

Anybody want to know miles here. Yes, what can I do given the things I just said.

A

Okay, so there was a jest in used. Shared memory and I have already asserted to you that every one of those kernels as good as it can be so I cannot use any memory. Optimization techniques on the GPU to make them faster.

A

Running okay running on the CPU right and that that's the answer for this workload right: this would be faster on the CPU, but the trick wet. The trick. Answer to that question is make the problem bigger right. This is my fundamental argument that you should not cannot run small workloads on the GPU in almost every part of science. That I can think of. There is a way to make your problem more expensive by giving it more work and having to be higher fidelity right. If it's a grid code, you give it more zones.

A

So it's more work, but it's usually a higher resolution. Higher fidelity thing. If you're doing molecular dynamics right, you, you add more atoms right, you know, there's all sorts of things. You can typically do to add more work to at the cost of making more computation expensive, but also higher accuracy, higher fidelity, and so that is what you do for GPUs. You do not run this problem. You do not run this version of the problem.

A

You run a bigger version of that same problem, and you run that problem until it's big enough that now this part, this little tiny chunk here is the entire long time.

B

A

Literally mean exposing more work. So if we look at that output.

B

A

So the question is: if a number of GPU cores stays fixed, how can add any more work? Make it any faster? And the answer is it's a great question and the answer is basically: this I told you before there are 800 zones to work on, so that's basically 800 degrees of freedom in the application, and the fact is that NVIDIA GPUs can have a hundred thousand threads resident at one time, so we're not actually using all the cores right now. So that's how we make this faster.

A

The rule of thumb is: if you have a hundred thousand to a million degrees of freedom in your application, you have it. You stand a good chance of saturating the GPUs compute capability for anything below that you're. Not using GPU, well get up and go home. Well, don't go home, make your problem bigger.

A

Right, it is right it absolutely. The comment was you can run into memory issues and that's right. You might make a problem so big that now you don't even fit on the GPU and I agree that sometime a problem usually for something. That's like a hundred thousand degrees of freedom that usually you can fit that into memory kind of generically for all of the science applications I know.

A

But if you make it big enough, you that is a problem, and that is a problem that is much more serious on GPUs, that on CPUs, that's outside the scope of this talk, but that's a great point: okay, so that was in tech systems and the the exercise, my tutorial, which you can yes, what's it with.

A

Okay right so so the question was: does has this actually been tested in ask I'm promising that it is because I am literally doing this at nurse right now on the core GP, you know it so I promise it works. That said, there is an issue where this stats equals. True option was added in, like somewhat recent version of n site systems and the CUDA module determines, which version you get. So if I do module list, you see I have cuda 10.2 to 89.

A

If you have something else in your environment like an older version of cuda, and maybe that's what the docs Beckham and and that's the issue.

A

So I recommend using the newest cuda, except in cases where you can't, because of compatibility issues and then- and if you run in that case, use the other one to compile. But then module will switch into this to do that, use the profiler and that generally works.

A

Okay, so I have a couple minutes left I just want to talk about anti compute, so you do my exercises and now somehow you have made it such that the kernels are now the most important part of your application right once you've gotten at that point, then you, you I, permit you so then look at when those kernels intense I compute, so I can zoom in on this part of the application I'm using control and pinch motion to zoom in it's easier.

A

If you have a mouse in a scroll wheel, I can zoom far enough in that I start finding individual kernels. You can see, there's a tiny on the timeline of the application. Sorry about that and.

A

Yeah I might want to following on the WebEx on the zoom. Sorry, but I I am highlighting over a kernel, and it gives me information and I highlight over one randomly and I'm going to say that's the one that now takes the most time and in this case I happened to highlight over a few tendencies X. So let's pretend that compute tendencies X is the most important kernel that now that you've done your your review. Factoring. That's not knowing isn't it.

A

So now what I'm going to do.

A

Is I'm going to profile those with insight compute, so I have given given the K option and picked a particular kernel Liam, and then it will profile all indications of that kernel.

A

You can see this already taking a lot longer than it used to, because each one is being run 17 times. That's the number on the right, so I'm now profile every instance of that kernel is being profiled 17 17 times, so it's actually gonna take a little bit longer and it happens to be the case of it actually two functions that have compute tenancies X in them so as profile and both of them.

B

A

We saved the initial state and then loaded back to ensure program remains valid in the trap, if you abort in the middle, which I got tired and did you'll get a standard out like a ski print out a whole bunch of things which basically are the same thing that the you I would give you if you loaded the report.

A

So in order to make life simpler and because I have one minute left, I'm, gonna intentionally profile, only one kernel, so C 1, it means only profile 1, one instance of that kernel and then what I'm going to do is store this in a file, so I'm gonna, say mini weather, and so you now see instead of giving me standard out to ASCII. I have now created a file, and it has this extension, which again is a mouthful I'm. Sorry, I'm, gonna copy that down I'm gonna go a couple minutes over hope.

A

Nobody gets mad at me. Just a couple minutes copy that down and now I have my report file I open up the insight, compute user interface, which I've already pre opened. It looks like this. You go to file open file, Jung called open project and then do open my mini weather and psych coupe F report, and so finally, this is the view that I gave you a screenshot of like half an hour ago, and this gives me several sections that many different levels, analysis and I'm just gonna stay on this one.

A

For the sake of time, this is the GPU speed of light section that I showed you, and basically this tells me if I look at these bar graphs. These tell me that I am using one percent of my peak memory bandwidth and one percent of my peak compute. So is that a good you spend you pure value of my GPU yeah? That's not a great use of the GPU right, you don't you!

A

This is a latency bound kernel right, latency bound is about in this is generally when you're, not memory bandwidth bound or you're, not compute, bound sort bound by latency. Remember I said that GPUs are latency hiding processors. Anyone operation is very high, latency six, hundreds of cycles to go to global memory, 2d Ram, but we can hide up having lots of work ready to go set at any one clock cycle. The work that is ready to go can go.

A

This thing only has 800 zones of work to do this problem, but we have a hundred thousand threads that we could be doing so like point, one percent of my threads potential threads are active. No one percent, one percent of my potential threads are active, and so it kind of makes sense that I'm only getting something like one percent of the peak compute, our memory bandwidth right I can never achieve the compute peak performance because that peak performance is relying on having enough threads going to hide latency to achieve that peak.

A

And so, when you see something like this, you know your latency bound, and this particular case. The answer is add, more work right so make the big grid bigger right. That's the answer for this problem, but sometimes that's sometimes there are other limiters and the tool will help you I think guide you through that process. So that's the very high-level overview I just want to let you know that these tools exist and that you should make them part of your workflow.

A

The most important thing to do when you start running on the GPU is profile right. In fact, you should probably before you get on the GPU use anti systems to collect the profile of your application. Have that be the baseline so that? Well then, when you start putting things on the GPU, you see, did it get faster or slower and I'm gonna? Warn you first time it's gonna get slower right, because you allocated memory.

A

I took too long or you're doing too much memory transfer right, so I'm don't get bummed right, that's a normal part of the process profile! It see what your bottleneck is and then work to eliminate that later, exposing more parallelism or doing things like more effect we're using memory that sort of thing putting more work in a row so that memory doesn't have to transfer back and forth.

A

So you'll learn these techniques as you go I just want to let you know these tools are exist to help you do it use them rather than try to be God and read your source code and figure out. What's going on, that's a very hard thing to do any questions.

C

C

A

Kernel name is what the compiler gave to it, and it happened to be that open, ECC and open MP, give you nice-looking names and coop laughs does not give you guys licking names. So that is a thing that I cannot get around.

A

I will say that there are there's a light at the end of the tunnel for that thing at which we can discuss offline, but for now it's basically whatever name the compiler generates, and you can as long as you were then to find a substring of that you can give it to Kay, but it may be a pretty gnarly name. So Koko's is famous for generating colonel names, they're literally 2000, characters. Long and like breaking some tools.

A

I can tell for having done that in the past, so that may not be fun and I probably eyes for that, but we're working on that.

A

There's insight, can somebody read a question.

B

A

So the question was: does n cite the answer? Tools work on other platforms like Python, for example, beautify thumb? The answer is yes. The nice thing about the Nvidia platform is that all of the programming models go through the same underlying level of CUDA, and so everything generates CUDA kernels when it does work right at CUDA, kernel can be gender from CUDA C or it can be generated from Kuta, Python or number or Q PI, and so every tool can be profile every every program at how that runs. A nvidia gpus can be profiled.

A

This way, not all of the integration will be the same, so it may be easier to like correlate lines of source code from C to in the profiler than it would be from Python. So that part is different, maybe but the overlying, the underlying concept of being able to look at it, get a timeline of you and target particular kernels with insight. Compute could be done. Independent of programming language, yes,.

C

A

But that is a free program. It's it's. Just somebody in our marketing team decided that they wanted to collect that information, so you do have to join the developer program to do it. The other thing is: we have engine systems installed on query, so if you had VNC or exporting, you could go that path by just loading. The user interface remotely, if you're very close to the system that can be okay, can be tolerable if you're doing it.

A

If you're halfway across the country, the x4 wording is pretty rough for these applications and I had the recommend just fighting the bullet. Sc peeing it down to your system, download, registering for the relevant program downloading that tool and then running it. It's a free download. You just just need to record an account. That's right!

A

All right thanks! Everyone.