National Energy Research Scientific Computing Center (NERSC) Roofline Hackathon, July 8 2020, 9 Jul 2020

Previous Meeting

⏯

youtube image

►

From YouTube: Roofline Hackathon 2020 part 3

Description

Demo of Nsight Compute using toy kernels and real HPC codes

A

Presentation today, because this effort of doing replay, analysis and insight compute, which I'll talk to you about for the next hour and a half or so, is largely uh the result of a co-design effort between doe and nvidia, specifically a request from lbl and in particular um sam and charlene, who are helping do the session today, where they pushed for us to do this. Add this functionality to insect compute, and so I'm happy to report that, as of the um cuda 11, release events that compute, which is version 2020.1.

A

We are now able to do refine analysis in insight compute. um Before I talk about that, I want to give you a brief overview of nvidia's developer tools for profiling, to give you a sense of the landscape, of what tools we use now and uh how roofline analysis might fit into that.

A

So um the set of profiling tools that nvidia provides go under the nsight product family name uh in particular, we're going to focus on insight systems and insight. Compute I'll emphasize that these are not the only developer tools that nvidia provides.

A

There are also, for example, debugging tools like cuda gb, which is a cuda extension of gdb and can be used for debugging uh applications that run on nvidia gpus, as well as cuda memcheck and the new compute sanitizer, which are vaguely analogous to what you might use valgrind for for a cpu application, and we also work closely with the uh third-party tools ecosystem. So tools, like you, know, hpc toolkit, total view that sort of thing uh the empire and score p uh know how to talk to nvidia gpus and are supported on our platform.

A

So um the nsight product family is uh looks like this. Typically, you would start with n-site systems to get a comprehensive application level view of your of what happened when you ran your code. So it's collecting information on both the cpu and the gpu, and I was really telling you things about when and where you had gpu workload in your system and when and where you had cpu workload on your system.

A

So you always start here whenever you're running an application for the first time on a gpu and even later on, you typically always use insight systems to get a high level view of what's going on in your code.

A

It tells you uh when your kernels are running so kernels are just the the name for the discrete units of work that happened on the gpu, regardless of which programming language you use and uh inside systems helps identify where those are and, generally speaking, you use this to get a high-level view of the performance of your application and understand. Am I using the gpu effectively at all, and you can only be using gpu effectively at all if the bulk of your runtime in some sense is spent on gpu?

A

So if you have a gpu accelerated application and only five percent of the runtime is happening on the gpu, uh this suggests that you're, probably not using a gpu accelerated gpu compute node effectively. So insight systems should really be used to answer that question first, namely what percentage of the time am I actually spending on a gpu right, and you really want to maximize that, or at least get that pretty substantial, so that you know that you're using your gpu effectively.

A

Only then uh when you have determined that a particular kernel or set of kernels is uh dominating the runtime replication, should you then attempt to optimize those kernels. So this talk is mostly going to focus on that process of diving into a particular kernel and then analyzing its performance. But I just wanted to start by emphasizing that in some sense this is not the first step of the process.

A

Typically, because in most cases your workload will be more complex and it's not just a single kernel that dominates the runtime and you want to get to that place. But you may not start there, and so in this workflow, then you typically start with insight systems, identify your particular kernel or set up kernels and then analyze those kernels with n site compute.

A

If you are a user now of envy prof and its user interface, nvidia, visual, profiler or ndvp, we're generally encouraging you to switch to these new tools inside systems and then say: compute nv, prof and nvp are in maintenance mode, so we are fixing bugs as we find them, but we are not adding new features. All new profiling development is going into these new insight systems inside compute tools and in particular these will be the only way to profile on promutter, and so it's definitely worth your time to learn how to use these tools.

A

As I said, insert systems allows you to collect a high level application timeline and then inside compute is used for profiling, specific cuda kernels and getting very detailed performance information on those.

A

uh It's pretty simple, to collect a profile with insight systems at the command line, and I will give you an example of this. um In my walkthrough.

A

You just do nsys profile and then the name of your application and, if you add, dash dash stats, equals true. That gives you a summary output to standard out uh which summarizes a list of kernels that ran as well as uh other operations that occurred. So that's that's pretty similar to what you would have gotten.

A

If you used envy profit command line, no arguments, um so I'm going to use just the command line interface today, because the sample codes that we're going to be working with are very simple and only have one or a few kernels, so jumping into the ui won't tell us much more than we can see from the command line output, but for a real production science workload.

A

You typically want to use the inside system's ui to understand where the kernels are in your application, and this is an example of what that might look like if you did that we have, on the top part a view of what's happening from the cpu side.

A

So the workload on the cpu threads, as well as calls into the cuda runtime api and the cuda runtime api is, is called into typically, regardless of your programming model, so whether you use openmp, offload or opengcc, or a high-level approach like cocos or raja, or thrust typically they're calling into the cuda runtime api to actually launch work on the gpu and in the bottom half of the plot. You see information about the kernels that random gpu, as well as memory operations as well.

A

The kernels listed here are blue and the um uh memory operations are listed in red here.

A

Okay, so I'm going to jump right into inside compute, which is our kernel profiling tool um and tell you a little bit about how it works. So nsaid compute is designed to give you different views into different aspects of the performance of your replication and it's presented in the form of several sections, each of which tells you something about the performance of your application, in particular a particular kernel from that application.

A

So uh if the first section, which is always the one you typically start out with, is the gpu speed of light section, which tells you information about uh what percentage of peak you're getting for both compute, which is this upper bar and then memory bandwidth, which is this lower bar I'll, go through what these mean in more detail in my hands-on exercises. But then we have several sections that follow.

A

um One of the sections will be this roofline analysis section that I will show you during my demo, um but we also have other sections like a compute workload, analysis memory, workload, analysis, that sort of thing and again I walk through this ui um in detail in my uh extra in my walkthrough. So I'm not expecting you to understand all this now.

A

I just want to give you a sense of what you're, what you're getting inside compute has both a user in it like a gui interface as well as command line interface, and it's pretty customizable and in fact one of the things we'll see today is that you can customize it pretty strongly to do the analysis that you want to do so. One one example that I will talk about is that you can actually create your own roofline chart.

A

So if you wanted to add some roof line analysis that we don't provide for you, it's actually fairly straightforward to do. The main challenge is understanding what hardware counters you would need in order to provide the information you're looking for inside compute has a workload analysis section that allows you to memory work with analysis section.

A

Rather that allows you to see the flow of memory traffic through both the physical uh memory spaces, so like l1, cache, l2, cache and device memory, as well as logical memory spaces like global and local memory as part of its one of its sections. um It also has other things like compute and instruction workload. Analysis which we'll take a look at um insight. Compute does have the capability to create, what's called a baseline, to compare multiple versions of a kernel.

A

So, for example, you profile a kernel and then you make some tweak to it, and you want to see. Did my tweak make the performance better? You would then load that report in and then create a baseline for the original version and then get two bars, so you can see both the um current one and the baseline, uh and that would tell you whether your performance got better or worse.

A

You can also, of course, do that um with multiple invocations of the same kernel and same application in case you want to check whether the performance of that kernel varies as a function of time in your application and say, compute can also allow you to do correlation between the assembly instructions and your lines of source.

A

So the way the end site compute typically works, is that under the hood is that it's collecting samples of hardware counters uh each of the instruction um in your assembly code and then it's collecting information about different things that are happening at that particular assembly. Instruction like how much time was spent there, how many floating point operations, how many memory operations are occurring at each instruction, and you can, if you want to correlate that back to the source code, that you actually wrote, whether it be in c or in fortran or some other language.

A

As long as the compiler that you use can generate that information, you can then see where both where the time is being spent in your application, as well as how many samples for each of these individual pieces of information were collected there.

A

One thing that I'll emphasize, though, is that it can be pretty tricky to use this quickly. uh The fact that a number like a particular line of code has the most samples associated with it doesn't necessarily mean kind of naively that that's the most expensive binary application.

A

uh Understanding that really under entails getting a more thorough understanding of the the fact that gpus are running many instructions simultaneously, and so it can be. One must interpret this with caution and practice. It requires some experience to interpret it.

A

A

um Nsa compute, as I said, has both a user interface and a command line interface. um You can use both to drive your application.

A

um I'm gonna only today use the command line interface to drive the application and then, when I want to uh load it into the user interface, I can just save it to a file that nsa computer knows how to interrogate and demonstrate the results from that, but you can also just print some results to standard out if you want to- and this is an example of what that might look like, and it has many of the same fields so like the speed of light metrics, which give you a percent of peak, are the same numbers that you would see in those in that bar charts that I showed before.

A

If you want to profile the kernel with insight compute, you just use the command line, interface, name ncu, that's the user interface script that does the profiling. The name of this was a little bit different than previous versions of insight. Compute. It was nv nsight.cu-cli, which is both a mouthful and also hard to type.

A

So we hopefully shortened it and made that a little bit nicer for you, and this is available as of the most recent release of insect compute 2020.1, which is available on cory, uh so you just do ncu and the name of your application. If you do that with no arguments, what that does? Is it profiles every kernel in your application and because the the nature of gpus, we cannot collect an arbitrary number of hardware counters um at every uh invocation of a kernel.

A

So if you want to get a relatively detailed view of the application, the insight compute needs to re-run your kernel multiple times in order to understand its performance by collecting all the counters. You asked for implicitly that can make your application take a very long time, uh potentially orders of magnitude longer than it would be when it's not being profiled.

A

So it's typically recommended to narrow down your search a little bit in production applications by specifying either the particular current name that you're looking for. So that's with the dash k, command or by profiling. Only a certain subset of the invocations, for example, only profiling, one or a few invocations of the kernel and then leaving the rest unprofiled. There's another command line option to do that as well, which I could talk about.

A

But, as I said, you can also use the ui for driving the application. I won't do that today, mostly because on a typical hpc cluster environment, it typically makes more sense to drive the application with the command interface, save the file and then view it offline user interface. But if you were developing on your local workstation, you could use the ui for that.

A

Okay, so that was my overview to ensight compute as a general tool. Any questions before I jump into some hands-on.

A

Exercises uh there's a question in the slack: um do we need insight that comes with kudo 11?

A

um Basically, uh the answer is yes, if you use so that, let me say this: the version of insight compute that you use to collect the data should be consistent with the version of insight compute that you use um to view the data it's possible. If you have a version, mismatch that you will, that will still work. In particular, it usually works for a newer version of the ui to load an older version of the report, but it often does not work the reverse way.

A

So that might be something to consider. We can potentially sort out your installation issues for particular os, but if you're using the ui, it's probably recommended to use the the version events that compute 2020.1, which is the version we're going to collect with today.

A

I will also note, in response to the specific question that you do not need to install the nvidia driver to, um or even the crew toolkit as a whole, to install instant compute insight. Compute is available as a standalone installer. So if you um just google for insight compute- and I can give an example of the installer page um here- you can see if you click to this download button. It'll.

A

Take you to a download portal in the innovative developer zone, which can allow you to just download and set compute as a standalone tool rather than having to download the entire toolkit, and that will be a new new enabled version to support the analysis that we're doing today um and I'll also point out that this requires an individual developer's own login. So you're willing to create an account to go through this route. But it will hopefully won't take that long.

A

And the next question in chat was how to use nsa compute offline. That's basically, what I'm going to show you today is is collecting the data using the command line interface and then loading it offline into the user interface. That's the way I'm going to use on secondary today.

A

Okay, so I'm going to jump right in to some examples now. What I'm going to do is encourage you to actually walk through these with me. If you want you can just watch, but I think that you will gain a lot more from this exercise, as you attempt to do it yourself, and so. For that reason I will go through this relatively slowly, so you have the opportunity to follow along. If you want to. The first thing that you should do is clone the roofline on nvidia gpus repository.

A

So this is the repository that charlene was showing earlier it's on gitlab and I will just copy it in the zoom chat and if somebody could uh copy that into the snapchat as well that'd be appreciated.

A

um She'll want to go ahead and get clone this repo, so um I'll give an example of that I can just uh do git clone and then that git repo and I'm going to recommend that you clone it to nurse, because nurse is where we're going to actually be collecting the data. So I'm going to go ahead and do what I just recommended to get flow in this repository.

A

This will take a little bit of time to clone, because the example code that we're going to use has a relatively large input file we're going to fix that in the future, but for now it's kind of a large download. So I apologize for that. You'll see it takes a few seconds, so you can see I'm doing this actually on nurse.

A

I already happen to have an instance, a gpu on the query, gpu nodes, and if you do, if I do s run n1 nvidia smi, you can see that I have a single gpu available to me.

A

If you look at my module environment, you'll see that I have this set of modules loaded and I'm going to recommend that you match up something approximately like this, which you can get by doing module load and then gcc, pgi and cuda 11.167.

A

uh note that this cuda is not the default um crude module. This is one version newer than the default, which is 10.2.89, so go ahead and just do module load uh cuda.11.177 explicitly. So you get that there is also an insight, compute module, which is kind of decoupled from the cuda toolkit. I'm not going to go through that today, but the compute module always has the latest version of nsight compute. um It just happens to be that right now.

A

These two are the same thing, uh but nsaid compute does release more frequently than the cuda toolkit, and so you, if you want the latest and greatest you can always load that module explicitly.

A

Okay, so if I cd into my application, repo you'll see several files that should match the ones that we see on the repo page. So what you'll see is a gpp.f90 file.

A

um That's the actual application source code that we're going to look at today, um there's also a ancillary file which loads the input, data and kind of sets up all the arrays that we're going to work with, and there is uh this is the actual input data file, which is that large thing that I was talking about, there's also a readme which um goes through uh a description of. What's actually in this, I'm going to walk through this with you, so you don't have to read all of it now.

A

But if you want to refer to this offline, you can just look at this relatively detailed readme to understand. What's going on with the files that we provided here, in particular these scripts that are being used for collecting profiling data, there's also a make file which is used for compiling the code, I'm going to work with today. So if I inspect the makefile you'll see that I'm using pgi to compile openhcc code. So this is fortran code and we're using openhcc as the parallelism model. This for the main exercise.

A

I'm gonna go through in a little bit um one of the nice things about using insight systems and inside compute is that they're pretty agnostic to the programming model, so um anything that is capable of generating nvidia cuda code under the hood, which is basically what this is doing, uh can be used with insight, compute, and so um that's totally sufficient for what we're going to do today uh and not so.

A

I guess the corollary of that is that none of the principles I want to talk about are specific to open acc and I'm mostly going to ignore openhcc as a language. When I look at nsa compute, but I am going to focus against on opencc as I talk about the optimization of this particular model, just because it uh we do need to think a little bit about the parallelism in order to think effectively about how to use gpus.

A

Okay, so um what I'm going to do is cd into the tutorial directory for a second, and this is the set of files that we're really going to work with today, um and if I uh ls it, um you can see it's taking me a long time to do this. I think that's either it's because the query file system is being a mess or it's because my um script, which does some good things under the hood as part of my batch profile, it's kind of taking a long time, I'm not sure which it is.

A

uh So what I've done is I've created a readme that describes the tutorial that we're going to work with today, and I will explain what these files are, but we can also just look at the readme in our browser. So if you go to this tutorial directory and then look at the readme, I'm going to talk about this talks about what we're going to do today, so I'm going to actually go through this live. But if you get lost- or you want to refer to this later, the readme helps kind of describe in words.

A

The exercises I'm going to go through today. So the first thing that we're going to do is look at a very simple tutorial code. Before we get to this more complex gpp science example that we prepared for today, I'm going to go through some very simple, um cuda c kernels, which help both give us some practice actually using insight, compute to collect data and then understanding whether the roofline analysis and the other parts of the profile that we collect jive with our intuition about how these individual uh kernels should work.

A

So I'm gonna go ahead and open up this file um in my text editor, and what I'm gonna look at is three kernels that are part of this simple courtesy application.

A

I will do that as soon as the quarterfile system is acting in my favor.

A

Okay, so we have three kernels here, which we've named kernel a kernel b and kernel c kernel a is, let's take a look a little bit of time to think about what this does and if you've never seen, cuda c before, don't worry about it. Hopefully you can see that this is actually not that scary.

A

What kernel a does if we just look at this, this main part of the work here, which is the um the actual thing we're going to focus on kernel, a takes a simple array, 1d array that we're calling a this is an array of doubles and just um creates a simple local variable d, which is the result of adding up a particular element in a 100 times.

A

uh Well, um we're at rather we're unrolling it a hundred times and the amount of times we add, is determined by this parameter m, which in this particular application, we're handing as 10 000. and so we're going to add the same number 10 000 times this local variable d and then restore the result of that back into that same array index. This is obviously an extremely contrived example.

A

You would never do anything like this in a real science application, um but the reason I've chosen this example is that it um there's a lot of compute work here right and so, if we think about um the plot that that sam showed earlier, where you're in both the bandwidth band regime and the compute band regime, this should probably be in the computer and regime because we're doing a lot of floating point operations for a relatively small amount of work. So I'm going to pose a question if m is equal to 10 000?

A

What is the arithmetic intensity of this kernel? Take a second to think about it, just by reading what we're doing and then, if you think you know you can post it in the zoom chat.

A

So I'm looking for a specific number here, which you can.

A

A

Okay, yeah, that's that's pretty much right. So, basically, what we're doing here is we are doing 10, 000, double precision, floating point operations and approximately speaking we're just doing a single load and a single store right now. How many bytes are there in a double precision? Word: there's eight! That's exactly right, so you could estimate the arithmetic intensity of this kernel approximately as ten thousand a rate now.

A

uh Somebody else in the chat pointed out that this may um be affected by the cache line size right because in reality we are not just learning a single uh eight byte word of a we're, in fact, loading um a full cache line. Typically now the reason that that is um not so relevant for this is really depending on the way the innovative gpus work, because each thread in our cuda kernel is is accessing a different location in a so.

A

This is something that's specific, so this index means that we're getting a unique index of our thread in the cuda grid and then loading and then each thread is operating on one element of a and so even though it's true that multiple threads or any one thread is only operating on one element of a rather than the full cache line.

A

It's also true that um multiple threads are exiting that same cache line at the same time, and so um from a from the perspective of how we typically would analyze uh this piece of code. Typically, we would. We would use that definition of arithmetic intensity of ten thousand over eight or twelve fifty. um But um what we'll see is that um in fact that could be affected by the cash, um but it turns out.

A

Actually the arithmetic is in fact 1250, but it's it's good that you're already kind of understanding that the cash effects could affect that, but typically what we do in roof light analysis. Is we separate that out right? So we just focus on the number of bytes moved from dram in order to do this, and because it happens to be the case that, uh for this application we are going to be loading every element of a and coalesce loads.

A

We don't have to pay for that cost, whereas if only a single thread were to execute this loop on only one element of a then yes, that would be a relevant factor.

A

Kernel b is actually identical to a and I'll explain what the difference is between kernel, b and kernel a in a second um and then kernel c is a little bit different. So kernel c has a strided memory access, so a at some location, determined by our unique threat. Energy in the grid is equal to b at some other trident index plus b. So this is just a single double precision. Add the strided index is given by this formula and rather than try to parse what this math does.

A

Basically, the way I think about it is that the threads in warp, zero access, the location at a of zero, a of stride, a of two times, triad, etc, and because we're choosing a stride of 16.

A

That means that warp zero, the threads in warp, zero accessing memory location is 32 bytes apart um warp, one is then accessing the next locations, so thread zero and warp. One x is a of one uh thread. One axis is a of strike, plus one etc, and so the end result is that every location in a does access get access exactly once and the same thing is true for b, but for any particular thread.

A

The element of storing in a is equal to a different offset in b and any particular warp is accessing very um just disjoint locations in memory. This is pretty much one of the worst access patterns you can have um from the perspective of coalesce loads for particular warp, um and uh the question was: am I assuming 32 threads per block, I'm actually using 64 threads per block? In this example, um I just used warps here for simplicity.

A

um The the most important point is that every warp is accessing locations that are uh stride, locations away from each other, and so the question to answer here is- um or one question to think about- is what is the performance of this kernel.

A

So I mentioned that from a perspective of coalesce loads from dram. This is actually pretty bad, so you could attempt to compute what the um kind of in your head and, if you want to take a guess and chat, what the performance of this will be. um What kind of dram bandwidth we'll get from this kernel? You could attempt to do that so go ahead and think about that. If you want to now um for these three kernels, uh we are creating arrays that are of length um 80 by 2048 by 100.

A

I've chosen this number because 80 by 2048 is the total number of threads that can be simultaneously resonant on a single v100, gpu and v100 is the gpu that we're going to use today on the query, gpu nodes, and I have scaled that by 100 just so that there's sufficient work to do um and again these are double precision numbers you can see. I've just created with cuda arrays, a and b of that length.

A

I've just set them to zero, defining my threads for block and I'm watching kernel a kernel b and then kernel c, um and I mentioned that in code kernel, a and kernel b are identical, um but um the difference between how I'm launching them is that, with kernel b, I am loading. I'm setting the shared memory uh for each um thread block to be equal to 96 kilobytes. This happens to be the maximum amount of shared memory that you can request on a volta v100 gpu.

A

As long as you set this attribute for the function appropriately, and so essentially what this does is it ensures that only one thread block can be simultaneously resonant on a particular sm rather than the maximum which is 32, and so that will have effects on the occupancy of our gpu, because um gpus uh work by hiding latency by having many threads simultaneously resonant at once, and so when one thread issues memory load, we can then shuffle it off to the side and let another thread come in and do some work.

A

If we don't have that many threads simultaneously on the gpu, it's hard to hide that latency, and so we should expect the performance of this kernel to be different.

A

Okay, so, with that overview of the code, let's go ahead and compile it and run it. So if you are, uh if you have the cuda module loaded, which is just kudo 11.167, you can just compile it with. um You know your standard mvcc command if you've used created before this is how it looks the dot cu extension is just convention for cuda, but it doesn't have to be that way. There's a flag you can use. If you want to use just name it.cpp, for example, you'll see.

A

I now have a tutorial executable and I can go ahead and run it um on my gpu. I don't have any output, because I haven't done like any error, checking or diagnostics, but I can go ahead and use nsa computed command line to profile this kernel.

A

This will make it take quite a long time because now, where you see that we're running each kernel 19 times in order to collect the requisite statistics for the information that we requested and then the output lists each kernel one by one and then gives you some summary output, and so it gives you this gpu speed of light section which gives you some summary output about how effectively you were using the gpu um and then the same is true for kernel b and then kernel c.

A

Now um that only collects a relatively limited set of information, you can use the dash dash set full command to collect uh pretty much the full set of statistics that insight compute is capable of collecting, and um now each one will have to be run 75 times, because we're now collecting more data, which requires more hardware and software counters to be collected.

A

So you can see this will actually make the application take even longer to run, and so this quickly becomes a chore if you have a real science application, and so it's important to profile only the kernels that you're interested in and then finally, what we can do is store the output of this to a file, so I'm going to do ncu-tutorial and what that does. Is it stores the output of this to a particular file which will have the file name tutorial.ncu dash rep, so the file extension gets automatically added.

A

I just have to give it the the name of the file, so this will take about the same amount of time to do, and then what you'll see at the end of this is that I don't get any summary output to standard out. I just get the fact that this report file was created and, if I inspect my local directory now you'll see I have this tutorial.ncu rep file.

A

Now, what I'm going to do is copy that file down from my system on corey from the from the query system to my local computer, where I'm running where I'm logging in, and so. If I look at my directory, um you see that I'm in this directory. This is just my home directory refine video gpus tutorial.

A

So I'm going to do is in a separate terminal window, I'm going to copy from cory this file tutorial.ncu wrap down to my local system. So you can see I've done. I've done that sap process and.

B

A

This is the workflow I'm going to use today because I would prefer not to try to drive the gui from remotely from the system, as I think charlene pointed out, it's possible to do this using either like exporting or no machine if you have to, but I strongly encourage you to download the user interface on your local laptop and then try it. But if you didn't get a chance to do that ahead of time, you don't want to do it now.

A

You could, if you wanted to run it from like through some sort of exporting or vnc forwarding, using the name of the command interface, which is ncu-ui.

A

um So again, I'm not going to do that today and in fact I don't think I even have the right x forwarding set up. I guess I do have the right exporting setup, but it looks like it crashed. So I haven't bothered to try to debug that, but if you can get that working, that's one way to run the user interface.

A

Instead, I'm just going to load the user interface from my local system, and it looks like this when you open it, you get this pop-up box which, if you had any recent files open, would show you this. I'm just going to x out of this box and I'm going to manually locate the file that is downloaded on my system, I'm going to go to file and then open file, I'm not going to use open project.

A

That's slightly different, I'm I'm looking for a particular ncu-rep file that I previously downloaded I'm going to go to file and then open file and then just locate that on my system, so it was called tutorial.ncu rep, I'm going to go ahead and open it.

A

So the way that n-sec compute works is that it has every invocation of the kernel as a separate launch in this launch page, and so you can see I've launched kernel a kernel b and kernel c exactly once um we start with kernel a because I happen to be the first one that we launched in the application and what we're going to look at is this first.

A

The first thing you see is the gpu speed of light section and the gpu speed of light section tells you um what percentage of both peak, compute and peak memory bandwidth we achieved for this particular kernel, so the the it goes from 0 to 100, and if your bar is at 100. This means you are effectively using 100 of your compute subsystem effectively, and you are then bottlenecked by that right. You can't get better than 100. This is just the limits of the machine and then a similar thing would be true for memory.

A

If you had 100 of memory bandwidth, that would tell you you're limited by just the pure hardware memory bandwidth of the system. So if I hide it over this, you can see I get a 99.81 sm. So sm stands for streaming, multiprocessor, that's just the name for the fundamental compute units on the gpu, so this means I'm using the compute units of the gpu. um Basically 100 right. I could not get any better than this um from the perspective of compute utilization.

A

um I would not. I could not, as I mean meaning, I could not do floating point operations any faster than I'm currently doing them. You can also see that that number is up here.

A

99.81, okay, if I scroll down a little bit, you can see that I have um the roofline chart. So that's the next thing that's happening here and um the roofline chart tells you um what the arithmetic intensity is, and so the actual dot on the graph is corresponds to the um the arithmetic intensity you highlight over it. You see both the arithmetic intensity as well as the performance and flops, um and so this is about 3.35 teraflops per second.

A

um The vertical axis is this number of flops, and the horizontal axis is the arithmetic intensity. So if I hover over this, you can see that the arithmetic intensity is 632.52. How does that compare to the number that we were looking at before? Well, if we refer back to our kernel, what we see is that we're doing both a load and a store.

A

So when we said 1250 before we were only accounting for one of these two operations, the true arithmetic intensity accounts for the fact that we're loading, eight bytes and then storing eight bytes, um and so really the number is 10 000 over uh 16, which is about that um 625.. So we're getting approximately the right answer. Approximately what you would expect 10 000 over 16.

A

um now the this is intended to look exactly like the plots that sam is showing before this diagonal line. Here is the memory bandwidth bound part of the um system? So that's for arithmetic intensity is below about 10. and then the square is located at the machine balance point that sam was talking about.

A

So this is exactly where memory bandwidth and compute are balanced and happens to be an error, spec intensity of about 7.5, for double precision and about double that for floating point uh single precision, which is listed as floating point here, and so the memory bandwidth is the same because memory bandwidth is memory, bandwidth, um bytes or bytes, but from the perspective of the compute bound part of it, there's actually different roofs for both the double precision, which is this lower bar the single precision which is upper bar, and that reflects the fact that the compute performance of nvidia gpus is actually not the same for both single precision and double precision.

A

They're about twice as many um single precision floating point units on a gpu as there are a single double precision, and so the peak double precision performance is about half of that of single precision.

A

So we can see that, in fact, our um kernel is exactly where we'd expect it's way over here in the compute bound regime, has the right, arithmetic intensity and is relatively close, at least in logarithmic terms, to the double precision roof. Now, if we look at the actual value here, we can see that's about 3.35 teraflops per second, and if I hover over the double precision roof line, we can see that the peak the peak is actually listed here as 6.7 teraflops, and so you can see we got exactly half of the peak.

A

So, what's going on here, we said that we're using our compute units at 100, yet we only got half of the peak performance.

A

Well, the reason relates to what sam was talking about earlier, that when we count flops, um the flops depend how we count ups depend on the operations that are occurring, so this does a double precision, add which is one flop in the way we typically count flaps, but the gpu, and that you can do a single double precision, add in a single clock cycle, but we can also do a single, a double precision.

A

Floating fuse, multiply, add in a single clock cycle, which is the equivalent of two flops, and so the only way to get this advertised sticker number of, like seven terabytes per second on a v100 and double precision, is to be doing fmas or fuse multiply ads notice that it's the same number of instructions right, whether it's a double double position, ad instruction or double precision, fma instruction.

A

But the number of flops that is associated with that is different right, there's a factor, two difference. That is a relevant thing to consider when you're counting flops, if we were doing an instruction based roofline like the one that sam mentioned, um this might give you a different view right. It's basically saying that we're limited by the instruction throughput of the double precision pipeline, and that would be true whether we were doing double precision ads or double position. Fmas.

A

If I were then to scroll down to the instruction statistics, I could then see that of the instruction mix of different instructions that were loaded by this run by this kernel. Almost all of them were d-add, which is just double precision, add.

A

All right, let's take a look at kernel b, so kernel b has the same arithmetic intensity because it's the same code but notice that its performance is much lower than that of kernel a right so on the vertical axis. It's very far away from the roofline and its peak performance is only um 827 gigaflops.

A

So it's like a factor of um six or something like that below the uh what we got before and if we look at our speed of light section above here, it's telling us basically the same information that we were getting only about a quarter of our peak performance compared to the 100 that we saw before uh so. In kernel a um we got 100 and then kernel b. We got only about 25 of peak throughput.

A

um I mentioned in my talk that you can use the baseline feature, so I'm going to go ahead and click add baseline, which makes kernel a which is the high performing one, the baseline and then go ahead and switch to kernel b. And so you can see that the sharp disparity in between the new one and the baseline, which is now colored in green. So the new one is blue. This is the current and then the um the new one is green. This is the baseline.

A

We can then plot these two, both on a roofline chart and see that the um the baseline one again does have much higher performance than the new one. The kernel b, the color of the ring here, represents uh which one of these two things it's referencing. So the outer ring of this data point is green, so that corresponds to the baseline. The outer ring of this data point is blue, so that corresponds to the current one, the one we're looking at now.

A

um If I were to scroll down here and look at the occupancy section. What I would see is that the theoretical occupancy of this kernel is only about three percent and that's because, as I mentioned before, uh nvidia gpus can have as many as 32 thread blocks simultaneously resonant on them. But we've artificially limited the number of thread blocks that can be resonant on this uh on an sm by using shared memory of 96 kilobytes, and so that means that only one block can be simultaneously resonant on that gpu.

A

Whereas if we go ahead and look at kernel a um the occupants, the theoretical activity is 100, because uh we are not using any shared memory for that kernel, and so uh there are. There is no limit on uh resources or there's no contention further memory resources in this kernel, so essentially our theoretical performance. Our occupancy went down by a factor of 32, which is a large factor in why our performance went down by this factor of four or whatever. It was now.

A

The fact that it didn't go by down by a factor of 32 is something we're thinking about. um I can kind of give some hints about that, but I'd encourage you to think about that. First, I'm going to go ahead and add one more baseline, so that now um this kernel b is the base line and go ahead and switch to kernel c kernel c. If I look at the roof line is way over here in the bandwidth bound part of the regime, you can see it's arithmetic. Consistency is 0.06.

A

If you look at the code, this is exactly what we'd expect, because we are loading so here, how do we count flops? Well, we're doing a single double precision add right, b, plus b. So that's one double precision uh floating point operation, and then we are loading b once and then storing a once and so we're doing we're loading, we're loading, eight bytes and then we're storing eight bytes for a total of 16 bytes moved and we're doing a single floating point um single floating point operation.

A

So my error is just 1 over 16 or 0.0625.

A

Note that the compiler will optimize out this right, it's not actually going to load b twice it's going to load b once into a register and then just add that register to itself. To this flip point add so we're only loading b at this index. Once note, though, very something very interesting is that it's very close to the bandwidth bound part of the roofline, whereas I said that from the perspective of memory accesses.

A

This is one of the worst patterns you can have, because every warp is only accessing one element of the potential 32 that it could have been loading at one time in a coalesced load and the reason that you still get a pretty high, effective, dram bandwidth. This is just the dram roofline here right right, we're not talking about l1 or l2.

A

Cache at this point is the fact that we're getting a lot of l2 cache utilization from this kernel, and so, if we look at the memory workload analysis, we can see that our l2 cache has a hit rate of 90. So this means that the cache is very effectively saving us saving our performance in this application and the way that that works out in practice is that um the?

A

um If you look at this code right if warp, zero loads, uh the cache line corresponding to this element, then, if work one comes along later, this element is already going to be in cache. This element, one and a similar thing is going to be true for all of these other locations as well, that these will all have been loaded into cache as well, and so in this particular.

A

A

Now one thing that we have done uh for this application is- um or rather I should say, one thing that we've done at the nurse installation is that we have created hierarchical, roofline, charts that show l1 and l2 cache, and so, if I were to look at the l1 and l2 cache, I now have roof lines for both l2 and l1 cache as well. So this kind of is representative of what sam was showing before, and so, um if I look at my dram value here, I can see this.

A

This is the 0.06 and if I look at my l1 achieve value, for example, um it's 0.02, but the actu, the l2 value is actually pretty much right on top of the dram value.

A

You can't even distinguish them in this case, which is a pretty um good example of what we were talking about before of how um I'm sorry, it's actually right below the l1, I should say rather than the dram um pretty good example of how, for many applications, um whether or not l1 and l2 and dram are spread out, uh really affect um your interpretation of performance application.

A

Okay, I'm going to uh stop talking about this toy example. Now one thing that I will say is I pretty much stole um these examples from a wonderful talk that I'd, encourage you to listen to which is um at was was presented at gcc 2019 from last year. um So it's this talk that was given by um some folks, our engineers on our developers, tools, team, sanjeev, satur and magnus stranger. In fact, I saw magnus, I think in the participants earlier.

A

I don't know if he's still paying attention, but he gave a wonderful talk on these three kernels, which really helps understand the way that um multi-threaded applications work on gpus, in particular understanding how warps get partitioned on sm. So if you want to get a really detailed view of how these kernels play out from a performance perspective, I'd encourage you to go ahead and check that out that talk out and I've included the link to that at the bottom of my slides.

A

So if you download my slides later on, you can go ahead and get the link to that.

A

Okay, so, hopefully that demonstrates how we use the replan analysis tool and how it looks for both compute bound kernels and memory bandwidth bound kernels, I'm going to go ahead and switch over now to the the gpp exercise, which is a more realistic science case.

A

And so what I'm going to do is look at this gpp.f90 code. I won't so. This code is really a single kernel that we're going to look at, and one thing I'll say is that I'm not going to talk at all about the science case that rep this represents right. This actually comes from the berkeley code called berkeley gw, which is a material science application, and this actually does have.

A

This particular kernel is pretty much lifted directly out of that code, but I'm not going to talk about what it does right, because what I'm going to emphasize is that as computer scientists, we can analyze a particular piece of code without even understanding at all what it does or what kind of what it represents and in fact, sometimes understanding the the science of it can be a detriment to our understanding of the the code, because it kind of gives us preconceived notions about what it's supposed to do.

A

So I'm just going to show you some code and we're going to think about how the code works without really understanding what the science is behind it. But the readme for this um uh for this repo does uh give some links to some talks that charlene and sam and others have given in the past, which kind of give more detail about this application and what the motivation is for this.

A

So the single kernel that we're going to look at is this code. That's right here. This is a single open, acc um loop, and so it's a tripoli nested. It actually there's four nested loops in between each other, so there's these three loops and then uh final loop here. So this is just four do loops and fortran, and the only thing that this application does really the only work that it does. Is this single um triple nesting loop with this little bit in the middle that does? That is a length three.

A

So I've given in a comment here, the length the trip count of each loop. um This loop just has a trip count of three, um but the trip count of these outer loop is like a thousand each and then ten thousand for this loop here. So that's really all this code does right.

A

If I scroll down to the end, the only thing that's at the end here is that it just checks whether the output is correct and a lot of the boilerplate code that um initializes the data that we're going to look at is all off in this gpp dot f9d file. And so, if you really want to understand the data structures, you would come here.

A

It's a mix of one-dimensional and two-dimensional arrays and we'll talk about that in a little bit, but this this initialized data routine basically does the work of loading in this.dat file, which is the actual data for this and then allocating and storing the data in all the arrays. I don't want to focus on that today. I just want to again be computer scientist, look at some code and understand the performance implications of that code.

A

So what this code does is um it has these three outer loops which are loops over some elements and again I'm not even going to talk about what the meaning of these are. I'm just going to treat these as as code elements right, so we have a loop over this nt-band dist over ngp own. You know for and cools, and then these are the loop indexes indices and one lock, igp and ig, and then iw is this inner loop, which just has the strip count of three.

A

If I look at what this code does, is it could it computes some values like w children who are two et cetera?

A

It has some conditional code uh here and here and then what it does is it stores its result as a sum reduction to these values, ssx array and sch array, and now in the original code that this came from these were intended to be actual arrays like length three, but openacc does not support array reduction, at least in the 2.7 standard that we're going to work with today.

A

So um for this code we have explicitly broken up the reduction into these three components manually right, so you have the underscore one component underscore two and underscore three, and so we're reducing we're doing a sum reduction over six variables, um which are just uh three three components: each of two arrays.

A

If you've never seen openhc before that's totally. Okay, I mean this works exactly like you might expect. If you used, for example, openacc, um the sub reduction means the same thing. um The concept of present just means that this data is already on the gpu um loop and gang gang vector. Tell you something about the how the parallelism is mapped to the gpu, uh which I won't focus on today.

A

We're just gonna basically treat this as a fully collapsed loop, and that's really mostly we're gonna focus on today, where we just flatten out these three into a single loop of length. You know, like one thousand times a thousand times ten thousand, so we have a fairly large amount of work to do, which is good, because gpus are only effective if you can have a relatively large number of degrees of freedom to work with, so it needs to be typically like a million or more, and this satisfies that requirement.

A

One thing to note is that these kernels use double um complex, double precision: arithmetic, um I'm not going to talk too much about that, but just be aware that when you see something like conjugate, this is referring to the complex conjugate which in fortran is an intrinsic that you can work with.

A

Okay, um just take a brief pause, any questions on either this code or people getting stuck downloading the code or or anything like.

A

A

Okay, I'm gonna keep going then um so uh in the chat somebody asked. Is there a c or c plus version of this code? The answer is yes, but we don't have it in the repo right now. I think that's one of the things we want to do in the future is have a c version of this code and, in fact, in the actual berkeley gw code uh that this comes from.

A

This kernel has now been converted to c plus, um so it shouldn't be too hard for us to create a sql version of the code, but we haven't. We just haven't gotten to that. Yet, for this um tutorial, repo.

A

Okay, so if we uh just compile as long as you have the pgi model, module loaded you'll be able to compile this code. The makefile has the right flags for you and what you'll get is this gpp.x executable?

A

If I run that what this does is it does two things: it actually loads the data and then runs this kernel. It reports the time it took to run the kernel that that triply nested loop, that we saw and also has some diagnostic output. That tells you whether you got the results correct or not, um which is useful for uh if you make some change, you want to make sure that you didn't get the answer wrong when you did that so having that validation is important.

A

Now um what I'm gonna use uh to help me collect the data. Is I'm gonna? I created a simple tutorial script, which is in the tutorial directory. So you see, if I look at the tutorial directory, I have this profile dot, sh script, and if I look at that, it's um really just collect it's just doing ncu and then dash dash, set full and then that gpp.x, I have some logistics in here which are kind of specific to the nurse can stall and that help us get our custom hierarchical, roofline analysis.

A

So those hierarchical roof lines that I showed you um this like double precision, hierarchical roofline chart is not shipping as part of the default part of the tool, but we created this as a custom report section for insight compute in this tutorial repo and so we're just kind of giving it to you and then maybe later on in, like later versions of encyclopede, we'll look at installing these as kind of a default set of report files that you can collect.

A

But for now for this part of the repo we've done this, for you.

A

And yes, we can, generally speaking, use newer versions of the tool to run code compiled with older versions of cuda. So I think pgi is compiling with cuda 10.1, as you can see here, but we're using endset compute from 11.0, and that does work.

A

So what I'm gonna do is um run my profile um script. I have to run it through s run, so I can run on the gpu and it takes a single argument, just the um the name of the profile to make it simple. So I'm gonna name my profile baseline and um what that will do is create a baseline.ncu rep, which is the baseline version of this code. Now, unfortunately, this is going to take quite a bit of time to do because remember we have to profile this code 75.

A

We have to run that kernel 75 times in order to collect the statistics, so, whereas it only took like 1.8 seconds to run, um we were not profiling. It's gonna take like a minute or two to profile um when we uh to click the profile.

A

So let's take I'll just give people a second to catch their thoughts catch their breath. While we collect that baseline.

A

The question in the chat was: does nsa, compute work with um libraries that use cuda, aware mpi, so there's kind of a couple things to break down there. One is that both insight systems and inside compute are not really designed for large-scale parallel profiling, so you can use them to profile individual mpi ranks and then just create an individual report file for every mpi rank um and then for inside systems.

A

It is possible, in some cases, to profile all of the npr ranks on a single node if you're running insight systems from the same node where the mpi ranks are running, but it does not extend to multiple nodes. Yet.

A

Now the second part of the question was about cuda, aware mpi, that's a subtlety that I won't really get into, but generally speaking, yes, there shouldn't be any additional complexion from the fact that um the gpu buffers around the gpu, because that really is kind of orthogonal to nc compute, which is just analyzing your kernels. It will just kind of ignore the mpi bits.

A

Now, while I wait for this to finish, I just want to say a couple more words about this um kernel, and so um we saw that we were doing a three-dimensional. A tripoli collapsed loop of our three um loop nests, which have uh meaningful work to do, and so that's a and and we've chosen that as the baseline code, because, basically that's what you would do as a naive, you know, as kind of your naive first attempt to profiling this application.

A

You generally would follow the paradigm of I want to expose as much parallelism as possible on the gpu right. Gpus are hungry for work, um exposing as much parallelism as you can is a pretty good rule of thumb. So, generally speaking, when you um port a code from cpus gpus for the first time, it's a pretty good idea to just expose as much parallel as possible, and so we've taken that approach.

A

We've flattened out this loop now so that there's you know tens of millions of elements of work to do and then that helps us ensure that the gpu is saturated at all times. And what we'll see is that that's not necessarily the most effective way to go about it and we'll talk about why.

A

But I just want to emphasize that we chose that as the baseline code, because it is, it is kind of what you'd expect to write as your first attempt at paralyzing this code.

A

I'm going to go ahead and open up another window, so while I collect that data, I can talk more about this code.

A

It's actually taking longer than I expected, I hope, cory's not hanging on me. um So um if we take a look at this code, um we might think about this question of. um Is there anything we can do differently? So one thing to note here is that, because the trip count of these outer loops is both a thousand the trip count of this inner loop is ten thousand.

A

If you take the product of the trip kind of any two of these loops, it already has approximately enough work to saturate the gpu right, even if you just paralyzed these outer two loops here, you have about a million elements of work worth to do, and that's typically enough to come close to saturate saturating, a gpu in many cases, and so we could think about whether we want to parallelize only two of these kernels instead of two of these loops rather and leave the third one unparallelized now the reason we might want to do.

A

That is that, if we look at this loop body, what we see is that there's some work to do, but what we're going to find is when we look at the profile, we might have a question about whether if this is so, the first question to answer really is: is this a memory bandwidth bound code or compute bound code, and I think that if you look at this code and stereo seared it long enough, you will not be able to figure that out right. This is a relatively complex bit of code.

A

It uses complex arithmetic, it has, you know, division operations, it has many loads. It has this exponent operation.

A

It has an absolute value of a complex number, which is not a simple, not a trivial thing, um and then it has these reduction operations, and so I think that it's relatively hard to understand just by inspecting the code, whether this has been with bound or compute bound, and so the first thing that we're going to do is look at our profile and try to understand that um sorry. This is taking much longer than I expected to collect this profile.

A

Let me give it like 30 more seconds to see if it finishes, but maybe it's hanging or something like that.

A

Oh there we go see, patience is a virtue. It just took 317 seconds to the profile, so that obviously is a long time and we'll probably want to make it shorter going forward by not collecting all the statistics as an example.

A

So I got this baseline.ncu rep file here, so I'm gonna go ahead and do the same process of copying that down to my local system, I'm gonna scp from corey um same location reflect on video.gpus, but this is called baseline.ncrep all right. I'm gonna go ahead and open inside compute, I'm gonna um clear all my baselines, I'm gonna close out this old tutorial report file and go ahead and open uh my new file, which is called baseline.ncu.rep okay.

A

So let's go ahead and look at the roofline chart and, interestingly enough, if we look at the double precision math the double position: math has an arithmetic intensity of 7.36, which is almost exactly lined up with the machine balance point of 7.5 for double precision, math on a v100 gpu.

A

So this tells us that we're right in the cusp of being between bandwidth, bound and compute bound, there's also a second point on this curve. Here you can see this is a single precision number. It turns out that the compiler is generating some single precision um instructions, even though there aren't any explicitly in the code, but this is so you know. The performance of this thing is completely irrelevant right. Almost all of the work is happening in this. You can see that this is.

A

You know over two terabytes, a second, whereas this is you know, 50 gigaflops a second, so we can ignore the single precision, math and just focus on the double precision map, and so because this is right on the cusp of the memory bandwidth to compute bound part of the the regime.

A

What we might think about is that, if we want to make some optimization to this kernel, a logical step might be try to move it over to the compute bound part of the regime fully right. So if we're like over here somewhere, that's a good place to be, because now we can just focus on optimizing.

A

The compute um parts of it and and be sure that if we made the and be confident that, if we did make enough compute optimizations that we could hopefully get up to this roof line right, so the goal is to get up to the roofline and the way that we could do that is by having some room to breathe by being in the compute bound part of the regime I.e by increasing earthquake intensity.

A

It's always the dream of an hpc programmer to be in this compute bound part of the regime, because then you have a chance of using them, the full. You know advertised seven terabytes, a second sticker performance of the gpu. um It's not always easy to get there. In fact, it's often very hard to get there for many hpc codes, um but our goal will be like the logical first step might be: let's try to move over to this computer and part of the regime.

A

So we have room to breathe and get up to this roof line.

A

So what we can do, then, is change the code to help do that. So if I look back at the code.

A

What I'm going to do is take note of what I was talking about earlier about how I can choose artificially to only collapse two of the loops rather than three of loops, and if I do this, then um what that means is that one of the three loops will be executed serially by every thread.

A

So now I'm injecting a thousand or ten thousand elements worth of work, depending on which of these three loops, I choose to run sequentially in each thread, so I could do something like this and what this would do is it would enforce that each of these three um each these two loops are parallelized among threads, but both of these two inner loops are then run sequentially by each thread, so there's more work to do whether this brings us to the compute bound resume or not really depends on kind of the balance between floating point like memory operations, compute operations inside the kernel, but generally speaking, giving more work per thread, gives us a pretty reasonable chance of increasing the arithmetic intensity of that thread.

A

So now, let's ask the question which of these three loops should we put inside? We have some choice here right. We do not necessarily have to keep the ordering that we've listed here.

A

So let's expose- let's look at um which of these three ones, to do by looking at the memory access patterns of each of these uh arrays in this kernel, so there's 2d, arrays and 1d arrays. Most of these are 3d arrays. So if you look at these arrays, like w tilde array, I x array um here's another iaps array: here's aqsm10 local aqsn temp.

A

What you see is that most commonly ig is the innermost loop index and in fortran the innermost loop index is the fastest moving one, because it's column major and by the rules of good performance on nvidia gpus. We generally want sequential threads to be accessing sequential locations in this fastest moving index so because ig occurs very commonly as the first loop index. That means that typically it's going to be best for performance. We can guess if this is a sequential amount of threads.

A

Similarly, igp um is at least for one of these loops here. uh The innermost loop index, um and then n1loc here, which is our third loop index, is always the outermost index for any of the arrays that we're accessing. So a pretty good guess would be for our performance that um we generally want to ensure that array accesses to um ig and igp are coalesced as much as possible and for n1 lock.

A

That's the least important, because n1 is always strided right for almost all of these arrays and when locus writed, meaning that um sequential threads could never possibly access sequential locations of n1. Look because it's always the outermost index, and so sequential locations in this are never contiguous in memory.

A

So the pretty the best thing that we could probably do here is actually to choose the m1 loc to be our innermost loop index like so and then paralyze over um igp and ig.

A

So that's the what I'm going to choose to do, and um this makes sense, because when we collapse a loop, igp and ig will be flattened out, but the flattening that the compiler does is um is is kind of same in what you expect right so like uh when we flatten it out.

A

um Sequential locations in ig are still sequential and threads, and so for all of these arrays that have ig as the first index you will have, you will remain with coalesced access uh hugo points out that um n, when loc is coalesced in one of these arrays occ array and that's true right. But what we're? What we're kind of looking at is just the balance of arrays array excesses in this kernel.

A

We see that most of the arrays, um all of the multi-dimensional arrays of n1 loca, is the outer index and we can hope- or we can guess, maybe or at least we can experiment with the idea that that will offset um this.

A

So it's worth an experiment. um I'm gonna go ahead and compile it now with this change and then go ahead and run it and see. If that helps.

A

So um what you can see is actually that that um really did not change the performance at all right. It was about 1.8 seconds before and it's about 1.8 seconds now. So that's interesting um by the way.

A

If you didn't follow what I did um in the code, I have a set of git patches in a tutorial directory, um and so those get patches basically are the automated way to apply what I just did, and so um it just basically describes that and in fact, if you were to do uh get check out and then get apply um tutorial step, one dot patch, um you would see that what has happened is I've made that same change to my kernel and that's exactly the change that I just made where I made igp and ig the outer collapsed loops and I made the n1 loop, be a loop, be sequential so.

A

Finally, uh the next thing we can do is profile this code to understand, even though the code didn't get faster, did it achieve the thing that we wanted it to achieve, of making uh the arithmetic intensity of this kernel increase? That was the goal we had to begin with right. It wasn't necessarily to get faster. It was just to give us room to breathe so that we could then apply some optimizations, I'm going to go ahead and for the sake of not making it take forever to profile the code.

A

um Instead of doing setful, I'm just going to do set um detailed, which will give us a slightly smaller set of information which will still collect our roofline plot that we were looking at.

A

So I can do s run tutorial profile.h and I'm going to call this step one, that's my profile name.

A

So again. This will take um like a minute or two to collect the data. That's just an inevitable consequence of the length of the time it takes to um collect this. um So we'll just be patient and wait, and I can take any questions. People have while we're waiting for.

A

A

Right so the idea from the question in chat is that um we're trying to do two things with this change. First, we're trying to increase the arithmetic intensity right. That's really the the actual thing we're trying to achieve we're, trying to give each thread more work to do, which is one way of hoping that we can get a higher amount of arithmetic density, because we might have more flops per byte moved.

A

Given that we've chosen to do this, then we're trying to choose which of these three loops um to do that operation on and we're choosing n1 lock as the least harmful loop to do it on, because in the in all the two-dimensional arrays that we looked at and one loop was the outermost loop. So it is not sequential locations in memory um and we're hoping as an experiment that that will um offset any of the other cases like that occ array, where it is sequential in membrane.

A

So there is no like golden bullet here right, there's nothing! We can do that would um unilaterally make the performance of the kernel um better without any trade-off, but there's always a trade-off. But our guess is that n1 lock is the least harmful loop to put in this innermost loop because of those array pattern accesses.

A

Is it possible to use inside compute with containerized applications? Generally? Yes, although there's a little bit of subtlety, we can talk about that like during the break or offline later.

A

But, for example, if you download the cuda containers like on docker hub those ship, they either ship with inside computer, or you can easily install insight compute like just with apt-get, and then you can use inside compute within the container.

A

And then kind of a follow-up question: why is the innermost loop going to be the worst one?

A

Well, I don't know for sure, I'm just kind of hypothesizing that the reason why n1 lok is the um the one we want to use and not ig is that for most of the two-dimensional arrays ig is the innermost loop index, and so the general rule of thumb for nvidia, gpus or really any gpus is that sequential threads should be accessing sequential locations in memory and we're paralyzing these loops over threads.

A

And so we want to ensure that sequential threads, which are the sequential locations or sequential indices in this loop, are accessing sequential locations in these arrays and uh and when loc is for most commonly the one that is not coalesced, then it's not contiguous in memory has access to these arrays, so it's probably the one that we can do sequentially within threads, rather than parallelized among threads.

A

So you can see this still taking a little bit of time. This is just the kind of reality of doing this kind of performance analysis on realistic applications. I apologize for that, but hopefully it won't take too much.

A

A

Okay, so we've collected our profile you'll see it's step, one, not ncu rep. I'm gonna go ahead and download that, to my.

A

Laptop or to my desktop, um while I do that because of how long it takes to profile, I'm actually going to go ahead and apply the chain. The next change that I'm going to do now um and then I'll. Let it collect that data and then, as it's doing that, I will um uh explain um why, like what this change was and why we're doing it, uh and I need to not do that step two. Okay.

A

So while I collect step two and I'll explain what step two is in a moment, I'm gonna go ahead and look at step, one. Let's go ahead and file open and then uh look for step, one dot, nc rep, so I'm gonna go ahead and um first go to my baseline code and click add baseline. So we can see what our performance comparison is and then look at what the new uh code looks like, and so the orange is the baseline.

A

So that's this dot here for double precision, which was right around the machine balance point- and this is my new code and then what I see is that I now have an arithmetic intensity of about 20..

A

So my hypothesis worked out um and in fact I was able to increase the arithmetic intensities by a factor of about three, which is great, because now I can just focus on optimizing these um uh parts of the the compute workload and hope that when I do that, I just move upward um and can get close to that um refined point.

A

Now. um I kind of presented it to you as I knew what I was doing, but in reality um this is an experimental process right. So this is something that both the berkeley, gw folks and I have looked at for quite a lot of time to kind of figure out, which is the right set of steps to do.

A

But we didn't know that a priori right, so we kind of had to experiment, and you could do this kind of analysis that I was doing just by looking at the code if you're experienced enough at gpu programming. But sometimes you might just want to experiment and try different things and then see how they affect the arithmetic intensity.

A

And if you had done one of these other experiments that I mentioned, uh you might have seen that it would have gone a different way and then maybe that would be an indication that it was kind of the wrong direction to go in now. If you look at the actual utilization, you see that in fact it decreased a little bit right. So my sm and memory bandwidth percentage actually are a little bit lower than the baseline in both cases.

A

But that's okay right because we don't. This is not going to be our final step. We're just trying to get the arithmetic intensity increased. So then, now we can focus on optimizing these um uh parts of the kernel and then um go from there. So uh this is just giving us some breathing room for the optimizations that are gonna, follow.

A

Okay, so uh looks like I'm almost done collecting my data, um but let me go ahead and explain what the changes now uh for step. Two.

A

I can find my good okay, so um this is the baseline code and then what we did was we um collapsed two loops instead of three and then we moved this n1 lock in the middle and then that's one change. We made the next change. We're going to make is focusing on these inner bits of loop.

A

So what we've done here is we have a single loop with trip count of three and what we basically have done, as I mentioned at the beginning, is that we have manually unrolled this loop here, so we've for each of the three um ssx array and sch array, what I've done.

A

What we've done in the code is manually unrolled this reduction right, because our openac doesn't have the concept of array reductions to it to manually, reduce on each of these six elements of those two arrays and one might hypothesize that both the branchiness of this bit of code and the fact that we are now doing three reductions in a single kernel might affect the performance of this application right.

A

We might hypothesize that it would be better to simplify this logic so that the reduction is simplified and one way that we could do that to remove the branchiness of this is actually as kind of a trick which you might not have realized is that we could, in principle, move this loop outside of the actual opening c region right.

A

So nothing is saying that we have to have this loop inside the code and in fact, could be moved outside, and so I'm going to go ahead and show you what this looks like after we make that transformation.

A

What we do is we move this iw loop outside the openhcc region and um the rest looks like we have before, where you have igp and ig as a two dimensional collapsed loop and then n1 logo is our innermost loop, which is sequential, and the nice thing about this code now is that we can just um produce over two values instead of six, so sxx array and ch array, which are just single, complex, double precision, values, and then this branchiness here just gets replaced by a single set of values, ssh array and sch array.

A

And then, after each of these three indications of the parallel loop, we just add the their respective value to each of the locations in the actual arrays that we want to do the reduction over right. So this is kind of just a hack um reflecting the fact that opc doesn't have this array reduction. But in fact we might have wanted to do this anyway, to reduce the richness of the code and reduce the amount of this code that is related to reductions. So we can just focus on doing as much compute as possible.

A

So if I were to go ahead and run that code does that make the performance any faster.

A

It appears that the answer is in fact. Yes, we've decreased the time from 1.8 seconds to 1.45 seconds, and while we were looking at step one, I collected the profile for step two. So I now have this step: two dot: ncu rep. Let's go ahead and download that profile and look at what that did.

A

So I'm gonna go ahead and add this step one as a baseline. So now we have two baselines which we could rename. If we want, so we could call baseline three equal to b our actual baseline code.

A

And then this red one would actually be what we're calling step one. I'm gonna go ahead and open up my step, two all right. So let's look at the roofline chart now, and our current point is now here. So this was our baseline code. This was step one and then this is step two.

A

So now what you see is a couple of things. One is that if you look at the point, it's actually a little bit higher right, so we've achieved a little bit higher performance, um it's hard to see maybe on the scale, but it isn't. It is higher vertically. If you, if you hover over the point, you can see that this was 2.5 teraflops, whereas this one was 2.0 teraflops. This was a you know: 25 percent increase in and performance. That's that's definitely non-trivial.

A

Another thing to notice is that our arithmetic intensity actually decreased again, so we had an arithmetic intensity of 20. Before now we have an arithmetic intensity of 10.. So this is interesting right. Our goal was to just move this point vertically upward and we did move it vertically upward, but we also moved it to the left right, and I think that this is really an inevitable consequence of doing roofline analysis in real applications.

A

It is very hard to just move the point vertically upward, uh because real code doesn't work that way right. Real code does not bend to our wishes of just kind of following a real set of trends. Gpus are complicated, compilers are complicated, and so it's definitely possible to move the performance upward, but not necessarily without keeping without changing the arithmetic intensity.

A

So what we've seen is that we increase the performance we are still in the bandwidth or the compute bound part of the regime, but um we, um this is one reason why it was important to give us that breathing room right. The fact that we moved way over to the right on this computer and part of the regime meant that we had some breathing rooms that we make a change which, in some sense decreases the amount of flops that are occurring in the loop, because we've removed some of the work right.

A

We made a streamline, simpler kernel, but we also made it a more efficient kernel. And so, if we go then and look at our um utilization, we now see a story where we have a much higher sm compute utilization than the baseline code right.

A

And so this is pretty nice, because what this is telling us is that, even though we had a little bit less work to do the efficiency of our work, we're getting uh better efficient use of the compute units on the gpu, and so um it's uh that that kind of correlates, with the fact that our total performance went up. So we decreased from you know, 1.8 seconds to 1.4 seconds.

A

Now. If you look at the um the time notice that that it's actually a little bit different, because we're actually launching three kernels now right, um because this court, this correlates to the fact that we are now um launching this current multiple times and so the time for an individual kernel is different, um but that uh the overall run time is um more than a third lower per kernel, and so we've kind of compensated for that.

A

And yes, um definitely one of the things that we want to do in a future version of insight. Compute is make it easier to either make this a linear axis or like zoom in or something like that, because in fact it is hard to see that difference. So that is a noted, pain point and we'll definitely hope to improve that in future versions of insight, compute and kind of a useful add-on point to that is that we definitely want your feedback on this. This is a new feature and then site compute 2020.1 with code 11..

A

This is not the final version of the tool. um We're definitely going to improve this based on user feedback. We've already gotten some great feedback from nurse before and we hope to get more feedback from from the users on this call. So that's definitely one of the things we want to get out of today is for you to go ahead and try this out on your own code and give us feedback both on.

A

Is this useful to you as an as as an analysis tool, and what can we improve in the data collection uh to make this better? Okay, so I just have five minutes left, I'm just going to talk about what the rest of the exercises would be.

A

I'm not going to go through them in detail, I'm going to leave them up to you if you want to do them later on and then I'll close with some fine, some parting thoughts, and so, um if we look at our code now from the end of step, two um there's a couple different things that we can do here to improve the performance of this uh code.

A

One of them is that the um double precision divides are um challenging because, as sam mentioned earlier in his talk, a division operation does not map to a single hardware instruction right. Division operations are actually a sequence of instructions which implement some algorithm to do a floating point, division and so in double precision or in single precision, but especially double precision on nvidia gpus. A division is not necessarily an efficient operation. This does not map to a hardware instruction.

A

However, there is a hardware instruction for computing, the reciprocal of a double precision number, and so in many codes that use floating point math. It is um beneficial to compute the reciprocal of a number first, so we can compute some temporary um variable like this. um Like that's one thing you could do right.

A

That would what you would be doing for a simple real um uh floating point number, it's a little bit different for complex numbers, but it follows the same principle right we're going to compute the reciprocal of a number first um and then um multiply by that now. Somebody asked a very logical question in the chat. Why doesn't the compiler do this optimization for you?

A

Well, the answer is sometimes it can, but one reason why it may not is that um that will change the result to um at least to the round off um the truncation error of your floating point position and compilers, don't always make those optimizations which may change the answers to that precision. It will definitely depend on the optimization level of your compiler.

A

So that is one thing to consider um when you're writing code is that at least in many compilers on many architectures, um this is doing the reciprocal and then multiplying by the reciprocal is a faster operation than doing a floating point division. This is not specific to nvidia gpus.

A

There are many architectures where that's true um and um the other thing that you could do as an optimization to this kernel is to look at the fact that this to look at some of these complex math operations and then find ways to do them that are less um compute intensive and so, for example, um the absolute value of a complex number is not just you know, taking the sine bit and setting you know, making it positive.

A

It's actually a more involved operation uh for the absolute magnitude, and so um you could find a way to change the amount of work make it more efficient by only looking at like the squared value of the value of the um of this ieps array and then compare it to uh just compare the absolute value here by taking away this app. So if we take away the apps here in the apps here, we can still do the same comparison if we want to, but with less work.

A

And so, if you look at in the tutorial the readme um step three and step four are the ones that I'm looking at and they basically describe what I want you to do. For this complex math uh and division operations, um and so those are some things that you could look at and I have provided for you um a step three and step four dot patch, which help which actually describe what I'm doing in case you kind of get lost on on that operation.

A

Okay, I'm just about running out of time. The last thing I want to say before I close is that um um you can customize inside compute to do your own roofline analysis. And so, if you look at our ncu-sections directory, um we have actually created for you some of these custom sections files that do hierarchical, refine analysis and so, um for example, if we look at the hierarchical double roof line, chart section. This is an actual section file that we created it's just a simple text file in json format.

A

I won't go through detail because of time constraints like how the the format of this file works. But if you were to analyze this text file, you could kind of get a sense of what it's doing and then make changes on what the metrics that you're collecting are and then create your own refine analysis, and so when we were showing before these, like double precision, roof line charts. These are actually things that we created on our own and then just added to our installation.

A

And so that's one of the nice things you can do with um insight.

A

Compute um is, you can add your own section files, and so one thing you can try is creating your own custom section file and if you find that it's really useful, you can send it to us as feedback and we can consider adding it to a new version of the tool in the future or you can get, for example, your local hpc center to install it in their installation of insight, compute, and so in fact, if you look at the installation of insight compute at nurse, um I'm not in the right window and you look at the cuda installation of insight compute um in create 11, which is here, um you look at the sections directory.

A

You can see that in fact, nurse has installed these section files that we created for this tutorial there. So you can actually take advantage of those directly in this installation if you want to otherwise you could just download it on your own and then copy the section file in there and then use it um just the way that I've shown in my profile script.

A

Okay, so um that was my kind of one hour and a half introduction to insect compute, refined analysis on both the tutorial example and the gpp example um later today you can either apply insight compute on your own code, or you can do these steps three and four that I've shown off in the gp exercise. If you want to kind of dive deeper any questions before we break for lunch.

B

I think there's one about what whether unsafe math optimizations will be used in the I think, step. Three optimization, optimization.

A

um Well, nvcc is not the compiler we're using for this. um I think it is true that mvcc can do that optimization. um I don't remember off top of my head what how it works for double precision. In particular, we made double precision um divides much more efficient in cuda 11 compared to crude 10, which is what we're using now. So this operation may, in fact, be um a lot less necessary include 11.. I haven't checked that yet.

B

Right, I think, there's a flag for fast math or something and.

A

Yeah, I just don't know whether this is part of that set of operations.

B

um Other than that, I think all the questions are covered.

B

So it's a great tutorial thanks max, um I guess we'll break for lunch and be back at a quarter. Quarter past quarter past one um so feel free to post your questions, or um you know issues on slack or on the google doc I'll, be monitoring all those places and um I'll see you guys in about an hour.

B

B