National Energy Research Scientific Computing Center (NERSC) Roofline Hackathon, July 8 2020, 9 Jul 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Roofline Hackathon 2020 part 1 and 2

Description

What is Roofline performance model, and the mechanism behind its data collection on NVIDIA GPUs

A

B

Right so this will provide a brief introduction to the roofline model. It's not designed to be an exhaustive survey of all the kind of roofline research. That's happened over the last twelve thirteen years, but we'll give you a basic introduction to what the model is and in general, how one might apply it in the abstract, so view acknowledgments start out with a motivating question. So let's say you just spent the last six months: porting your application to GPUs the question becomes: are you done? Is it worth it?

B

Did you actually make good use of your resources and to answer that question you need to get at the question of what is good performance. That is, if you're getting good performance on the GPU you're, probably done you should move on to other activities in order to to further your research. So let's imagine that you took your application and you profile the mix of loupe nests within your application running on the GPU and for some arbitrary ordering of lutenist.

B

You get these completely random different flop rates for different loops, but if some of them perform a very high flock rates, some of them get very, very low flop rates. That means that the operate alone is not particularly insightful. It didn't really tell you. Are you getting good performance because some of them are fast, so I'm a remorse low? Second, you could think about. Let's just take your existing code and run it on a Xeon or an AMD epoch and see what kind of baseline performance you get.

B

You could then use that baseline performance and compare to the GPU performance and conclude whether you're getting good performance. The problem is: is that that really just tells you relative speed up? That is some kernels got enormous speed, ups, that it was dramatically orders of magnitude faster on the GPU than it was on the CPU, but then there are other kernels like this one for which the performance speed-up may have been actually very, very modest, and only a slight increase in performance.

B

So, to get to this this core question of what is performance we need to answer to fund or address two fundamental aspects of what is good performance that is first to attain good performance.

B

You have to be operating your GPU in the throughput limited regime that is, you are not sensitive to on Bell effects that is you're not having a sequential bottleneck slowdown, your code, food, you don't have the device to Hostos to device transfers in eating your performance that is you're running with data, primarily once it's actually on the device and spending most of your time running with data on the device and third, your launch overheads are relatively small, that is you're, not launching a bunch of little microsecond kernels and spending multiple additional microseconds, just launching the girls.

B

You want to be running long, millisecond, one-second kernels, in order to amortize that overhead.

B

The second aspect of good performance is that you want to be making good use of the GPUs compute and/or bandwidth capabilities that is GPU has tens of teraflops of performance of compute and ballpark terabyte per second of bandwidth. You want to make sure that you're using one or both of those two to their fullest extent.

B

Ultimately, what we really need is this quantitative model rather than these qualitative or relative statements. That is, we don't want to just say: okay, it's kind of good. We want to be able to say we're getting 90% of our theoretical limit or we don't want to be able to just say we're twice as fast as a Xeon. We want to be saying we're making 80 per square at aiming 80% of our GPUs compute capability or its bandwidth capability.

B

The roofline model is really geared to be a throughput oriented performance model. It's gonna track rates, not time. So when you do roofline analysis, it ends up being a prediction of a flop rate or instruction per second rate rather than a prediction of runtime.

B

Ultimately, the way the model is constructed, it's basically independent of the instruction set architecture, so it doesn't matter whether it's a risk or Cisco architecture doesn't matter if it's x86 or power, and it's also implement in independent of the underlying architecture. This means it's applicable to a CPU or GPU, or even a TPU.

B

More recently, we've looked at how we might apply it to FPGAs or other reconfigurable architectures, and it helps us transform this, this kind of abstract good performance into a quantitative statement of actual performance.

B

So let's imagine just to begin with that. We are running on some kind of superscalar Xeon. Well, in this particular case, we have a skylake Xeon if just looking at the single core architectural diagram, it's incredibly complex, that is we'd, be the number of stages. The number of operations implied in the complexity of this.

B

This architecture makes it very hard for any individual to contemplate how this will respond to different code, so one option would be to try to just run us build a simulator or this guy like CPU, and then run our code through this simulator to try to make a prediction of what the performance would be, but that doesn't really give us insight as to what the bottlenecks are in performance. It only really tells us how does performance responded?

B

Slight changes in architecture, so it doesn't really give us that that high level intuition worse simulation is incredibly slow. It's gonna be orders of magnitude slower than simply running the code itself. What we really want our performance models that are orders of magnitude faster than just running to code by itself. So what we want to do if we want to take this incredibly complex with you about a processor architecture and simplify it down into a very, very simple model of what these cores look like.

B

So we might take this this kind of high level view and make the assumption that the individual cores in this machine can attain peak flops if they operate on local data. That is if the data is local. On cache, you always get peak flops. You might assume that all the cores are load balanced, running a single program, multiple data code. That means that you don't have any kind of on bail effects. You don't have any kind of load imbalance effects.

B

They're all the cores are doing the same thing, and thus you can collapse them down into one set of cute' capability. You might make assumptions like there's sufficient cache bandwidth and cache capacity such that cache capacity, misses, aren't really affecting performance. The only real effect on performance you have is how fast you can do compute and how fast you can move data on and off chip, so this kind of high-level model is really the basis for what we would call the DRAM roofline model.

B

So in this case, all we're really thinking about is data movement to and from DRAM and compute. So in that vein, the model is basically premise: taun answering the question which is going to take longer. Does it take longer to move data on and off chip, or does it take longer to do the computation once the data is actually on chip? So we can write a simple equation. That says here is kind of what we expect for a runtime. It is the runtime.

B

It should be the maximum assuming perfect overlap of the number of floating-point operations in our loop nest, divided by the peak swap rate of our machine and the number of bytes that have to be moved on and off the chip and the peak bandwidth to the machine, as I mentioned. This assumes perfect overlap of these two. If you don't have perfect low four-lap, then you have to sum these two, but for the basis of the roofline model. In this example, we will always assume perfect overlap of communication computation now to transform this into.

B

What's nominally a roofline model, we needed to think about rates. So if we take the original equation and just divide both sides by the number of floating-point operations in our code, we can transform the equation into this, and if we reciprocate it one more time, we actually get a slightly different equation. That is, the flop rate is going to be the minimum of either the peak gigaflops of the machine or the product of what we call arithmetic, intensity and peak bandwidth. Now this equation is the core equation for roofline.

B

It's the most basic equation in roofline, that is all also the most universal, but buried in this equation is an incredibly important term that I alluded to this arithmetic intensity. This is the ratio of the number of floating-point operations to the number of bytes in your loop nest. So in essence, this arithmetic intensity is a measure of data localities.

B

How much data reuse each of your loot nests actually have that is. It is that ratio of the total number of floating-point operations performed in that lutenist, divided by the total number of bytes, moved on and off the chip for that lutenist. So for the deep DRAM roofline model, this is total bytes to and from daemon, and that means it includes all ova cache all the prefetcher effects, any kind of speculation effects, any kind of data that moves on and off chip has to be included in the denominator of arras metic intensity.

B

That means that it can be very different from the total number of loads stores in your loop nest. That is a load and store is just a request to the memory subsystem to bring data into a register, but the cache hierarchy is right there to filter all of those loads and stores and distill them down only to a compulsory set that actually have to go to DRAM.

B

So one other way of viewing this, rather than just viewing it, as the total number of flops divided by the total number of bytes, is a ratio of sustained flop rate to sustained bandwidth. Basically, in that case, time old will cancel out in both terms, so you can view it as in either form for most cases. We will view it as the ratio of the number of floating point operations divided by the number of bytes and construct performance, instrumentation technologies geared to measure those two terms.

B

So, let's think about how we go about visualizing this. So if we take this basic equation, this minimum of flop rates in the product of AI and peak bandwidth, we can plot this as they you roofline bound using arithmetic intensity as the x-axis for a number of historical reasons. We always plot it on a log log scale for one. This makes it incredibly easy to doodle on a whiteboard to brainstorm, to think about how you might have orders of magnitude different bandwidths or alternately.

B

How Moore's Law has allowed you to have orders of magnitude faster CPUs and GPUs over the years? That is, those orders of magnitude, become linear steps, and thus data does not get squashed into the origin, but you see it well separated. So in this case the vertical axis is the attainable clock rate for our loot nest. The y-axis is going to be the arithmetic intensity. One of the terms in this equation is the peak flop rate of the machine. The other term in this equation is the product of arithmetic, intensity and peak bandwidth.

B

The model itself imposes a minimum function on these two, and thus we end up being constrained to say: performance must be on or below this line. Now, there's one very important facet in this figure. There is the transition point where you actually transition from being limited by memory, bandwidth to being limited by the peak flow rate of the machine. That transition point is the Machine balance. That's also an incredibly important term in this. In the reply model, the machine balance is the ratio of the flop rate of the machine divided by the peak bandwidth.

B

So in essence, this provides a dual form to what we see with arithmetic intensity, whereas arithmetic intensity, characterizes applications, machine, balanced, characterizes architecture. In both case cases, they are the ratio of flops to bytes for applications. It's total flops move divided by total by our total flops performed divided by total bytes, moved or machine balanced. It's the peak floppity or your machine, divided by the peak bandwidth of your machine.

B

So in essence the roofline model will tessellate this space. This two-dimensional space of flops and an AI into five separate regions. Those five regions are important to think about. First of all, we have this region above the dotted pink line. This is unattainable performance. This is faster than the speed of light. You can never actually be have a application run in this regime, because it's basically saying you are executing come compute rates faster than what the machine is capable of doing.

B

So we can just offer that say we can never have a dot up in this region. Second, we have this region here, where we are less than the machine balance less than this. This vertical line, but also greater than the Machine bandwidth, that's also unobtainable performance. Basically, it says that you don't have bandwidth to actually operate your GPU in this regime. That is, for the amount of data locality. You have, you have sufficient compute capability, but you don't have sufficient bandwidth to operate a GPU in this regime. The third regime is the bandwidth bound regime.

B

So in this case we have.

B

We have a low arithmetic intensity, that is, our arithmetic. Intensity is less than the Machine balance, but we also are operating less than the memory bandwidth of the machine may be within 50 percent of your machines peak bandwidth. We would describe that as a bandwidth, bound regime, your operating and then and you're, actually getting pretty good performance you're getting 50 percent of what the roofline can tell you that you can get it's not a hundred percent, but you are somewhat constrained by the memory bandwidth of the machine. The 4:3 gene is this compute limited regime.

B

So in this case we are right of the machine balance. That is, we have lots of data, look data, reuse, very high data locality, but we're not actually getting peak flops we're getting somewhere between 50 and 100 percent of the peak flop rates of the machine and then finally, we have this regime where we are actually below 50 percent of bandwidth or below 50 percent of the compute capability and machine. We might describe this as poor performance, so in this relation here.

B

This is where we're not actually operating the GPU and what would be considered a good regime. We want to actually make sure our applications get out of this poor performance regime and into either the blue or the pink regions.

B

So, let's consider an example, so our typical machine balance today is somewhere between five and ten floating-point operations per byte. Now, remember: that's floating-point operations per byte, and if you want to convert that to floating-point operations per double precision, word you need to multiply by eight that means you're doing somewhere between 40. You have to do somewhere between 40 and 80 floating-point operations per precision word in order to guarantee you our compute limited. That's where the Machine balance the machine transitions from being combined with limited to being compute limited.

B

Fundamentally, that's an artifact of technology and money and the way epic ations are driven today, it's very, very unlikely. That's going to dramatically improve in the future. That is in the future. If you improve bandwidth you're more than likely going to improve compute by even more so, we can mark this. This transition point at five flops per byte. This dishmachine balance. We can then consider a very, very simple vector, vector operation in this case, we're going to take two vectors x and y scale Y by a constant alpha.

B

Add it to X and store it into a third vector Z. If we think about this code, we're going to do two floating-point operations per iteration, that is, we do an addition and a multiplication on every iteration and we transfer 24 bytes to and from DRM. That is. We have to read: Y 8, bytes read X 8 bytes write, Z, 8, bytes, 8, plus 8 plus 8 is 24.

B

So if we take that arithmetic intensity of this lutenist, we take the to flux divided by the 24 bytes and we get points 0, 8, 3, flops per byte. That means that, for a vector vector like operations, this kind of Blas 1 like operation, we are going to be extremely memory limited, that is, we are far far below the Machine balance of this machine, which means that fundamentally, these operations are memory, bandwidth limited and they will perform at a very, very low slop rate.

B

Now, let's consider a more complicated example that has some degree of reuse. So in this case we took some kind of laplacian operator. We did a second-order, desperate ization of it, producing a 7-point, constant coefficient stencil in this case we're going to basically read from six seven points in memory. Basically, a star shapes stencil and write to a new vector, a new grid in memory.

B

So if we go and look at this we're going to do seven floating-point operations, that is the six editions, the one multiplication by a constant and we do a memory references that is. We read these seven points and we write this other point. So one might think that your AI is point one one that is: do we really take those seven flops and / 64 bytes?

B

Well, that gives us an AI, but the problem is: is that that's not the right, a I that we want that arithmetic intensity is really the arithmetic intensity, measured at the Elwha unlevel. The thing to remember is that for roof line for the DRAM roof line, we're always want to measure the data moving on and off chip, and the observation here is that a perfect cache hierarchy will filter out all, but one read and one write of this operation of this lutenist.

B

That means that in reality, only one of these memory references will actually miss in the cache, the other, the other memory references will all hit in the cache and thus not incur DRAM data movement. So if we actually do the calculation, we actually get a different arithmetic intensity. We get at the seven flocks divided by.

B

Its sorry, the seven flocks divided by sixteen bytes, which gives us the point four four flops per byte- that's the ideal arithmetic intensity for this kind of seven point stencil. So if we think back, where was triad, where was our vector, vector operation? Well, that's way down here at 0.083. If we do a seven point, stencil well we're gonna have five times the arithmetic intensity, but remember 0.44 is still less far less than the five flops per byte, which means that we're still a heavily memory bandwidth bound.

B

We got five times the performance, but we are still memory bandwidth bound.

B

So, let's think back to this original motivating question what is good performance so if we think back to our lutenist, so we have this random assortment of lewdness from our porting exercise of courting our application. To achieve to you initially, it looks like there's no real rhyme or reason to actual performance. However, if we were to so sort our kernels based on their arithmetic, intensity and plot them accordingly, on the x axis, we can then compare performance of our individual loop nests to the Associated reply model.

B

We can then highlight which of those kernels actually fall into either the bandwidth bound regime or the compute bound regime, either which kernels are making at least 50% of stream or 50% of peak, and we see that most of them actually fall under that region.

B

However, there's a few kernels that are actually outside that region, so we can actually observe first of all that we can have kernels that have low performance, so this kernel down here is very, very low performance. That is the the y-coordinate of this kernel. The flop rate of this kernel is very low, but we can actually say it's making good use of the GPU, because it's actually getting a very high fraction of the memory bandwidth of the GPU.

B

Conversely, we can have kernels that have orders of magnitude higher performance like this red kernel up here, but it's actually making poor use of the GPU it's making poor use of the GPU, because it's far below the compute bound rate regime for the GPU.

B

So we can then focus our performance, optimization efforts on trying to address these red, kernels and kind of bypass, spinning our spending a bunch of time trying to optimize these green kernels, because we can only get slight increases in performance for them. So, as a recap for this first section, the roofline model is made of two components. You have the lines, the model itself, the machine model itself, which are all the lines in the roofline model.

B

Those are your bandwidth lines, your 50% of string, your peak flow rate lines by definition, defying the machine balance, the transition point from where you transition from being bandwidth-limited to compute limited those lines are going to be unique to each architecture. So if you run those on a CPU, you'll get one set of lines. If you run those on a volta, you'll get a different set along lines.

B

If you try to construct a roofline model for an amp here, you'll get a different set of lines, but when you're running on this architecture, you choose your architecture. All of your applications on that target architecture will be compared to those same two lines.

B

The other aspect of the roofline model is application characteristics in this case. These are all the dots. The dots are basically defined by the number of flops that an application performs and the number of bytes that it moves. That is, the x-axis is the ratio. The x-coordinate is the ratio of those two and the y-coordinate. Is the ratio of flops and run time. This means each dot is unique to a loop nest.

B

That is this: loop nest is different than this loop nest, but it also has the subtle meaning that if you were to run this same set of loop nests on a different architecture, the dots may move slightly.

B

That is, if the number of bytes that actually have to be moved based on how the cache performs change, then the dots x-axis will actually change.

B

So, let's think about what the general performance strategy is for using roofline well, first of all, if you're a far below the peak flop rate of the machine that is you're in the compute limited regime like this red dot, but you are far less than the 50% of peak. Your real goal is to try to improve the performance of that individual lewdness. That's that's kind of obvious right. You want those loop nests to actually run faster, but there is a subtlety that is when you're, actually in the bandwidth-limited regime.

B

That doesn't quite mean that you're completely done the way you actually improve performance when you're in the bandwidth-limited regime is to increase AI to increase AI, the arithmetic and density. You have to decrease the denominator. That is, you decrease the data movement. You decrease data movement by improving spatial locality. You increase the decrease data movement by improving cache blocking you choose, alternate data structures or alternate data types that require less data in any of those veins. You may actually reduce the data movement for that loop nest by reducing the data movement.

B

You increase, arithmetic intensity, that is, you increased data reuse and by increasing that data, reuse, you're, able to basically slide this dot along the bandwidth-limited regime to higher performance.

B

So you know I kind of alluded to this early on, but you know you end up with the kind of question: how can performance ever be below the roofline? How do you ever end up in that regime, where your dot is not just right smack on the the roof line? Performance balance? Well, there's a number of different ways. So, first of all, we can have dots that are actually misplaced. That is the kernel itself. You may have been doing your instrumentation activity. You may have calculated the number wrong number of floating-point operations.

B

You may have calculated the wrong number of bytes. This can occur from a number of reasons you can have broken hardware or software performance counters. You can make assumptions on how many flops are actually being performed based on an operation or the way you calculate how many bytes are being moved may be off.

B

Second, the lines may be misplaced. That is, you may look at a architectural manual and say: okay, here's the peak bandwidth of the machine and here's the peak flop rate of the machine. The problem is: is that that may be an overestimate. That may be the ideal assumption on what the roofline should look like that target architecture. The way you really construct a roofline model is an empirical approach. That is, you actually have to benchmark the memory bandwidth, the machine and the peak flop rate of the machine.

B

If you fail to do that, depending on the architecture, you could be off by as much as 20% those assumptions. They'll also assume that you're a perfectly load-balanced, that is, if all the cores or all the SMS, are all driving memory subsystem. At the same time, you get one bandwidth if only one SM is driving the memory subsystem and the other 79 on a voltar are completely idle. You will never get peak bandwidth.

B

You will never get people ops of the machine, so you can actually think about how you might draw additional lines if you are heavily load imbalanced or if alternately you're, not generating enough thread blocks to fully occupy your GPU.

B

The third way you can be below to the roofline is: there could be missing lines. That is, there could be bounds other than even there flops. The original equation we did was based on an incredibly simplified model, where we distilled down that incredibly complicated architecture into just compute and data movement, but the reality is, is we can back off on a few of those assumptions?

B

We might assume that there is insufficient cache bandwidth or a cache locality, in which case cache the end with may become an impediment we can as make assumptions about whether we properly use the fuze multiply, add vector or tensor instructions, or we may just have too many non floating-point instructions impeding our overall performance.

B

So, let's think about a few of those cases and how to actually rectify them. So, first of all, let's think about the model or application instrumentation issues causing us to be below the roofline. So, as I mentioned, those theoretical performance specifications that you may get may be highly optimistic, the D ramp in bandwidth. That is the number of bits times. The frequency versus the sustained bandwidth could be quite different.

B

You may on modern architectures, whether they be CPUs or GPUs, fall into a turbo mode, where you actually run at a higher frequency for a short burst of line, or you may be under clocked, because you're thermally limited in either case that can affect your overall compute capability and then there's the more subtle aspect of what happens when you have a really really complicated loop nest and the compiler just gives up you may say this should never happen. It should never happen, but the reality is it does.

B

There are times where the compiler just balks at overly complicated code and generates poor quality code. So what we really need is an empirical perform a terrible approach to performance data. That is what we want to do is actually benchmark our target machines so that we actually characterize how many flops in our sorry company what the peak flop rate is the machine and what the peak bandwidth is of the machine. By the same extension, we want an empirical approach to data application characterization.

B

That is, we want to know how many flops removed and how many bites removed not on a theoretical basis, but on an empirical observation basis. So to answer the first question several years ago, LBL developed what was called the empirical roofline toolkit. This is a way we characterize CPU or GPU accelerated machine. It gives us the peak flop rates of the machine and the bandwidth at each level of memory of the memory hierarchy.

B

It was written with MPI, plus openmp and CUDA that allows us to run on multiple GPUs on a multi GPU accelerated node architecture. So we could run that on the quarry K&L machine. We get one set of data, we get a dram roofline, but we can also use the same tool to construct an l2 roof line or an l1 roof line, knowing what the bandwidth is of the target machine by the same extension, and we could actually run it on a SONET dev. This was a few years ago.

B

You get a dram, roofline or, and what this says now, one roof line, but as an artifact, it's actually an l2 roofline for this target machine. In this case, what the ERT has actually done is it summed up the performance on four different GPUs to construct this model.

B

So, let's get now to the next question of theoretical versus empirical, let's think about how we visualize this well, we may have this theoretical model. This is the quoted flop rate of a GPU or CPU and the quoted bandwidth of a GPU or CPU. If we actually go ahead and run ERT, we are almost invariably gonna get lower bandwidth. We will almost invariably get a lower flop rate. That means that our dot, even though it hasn't actually moved, is now closer to the nominal roofline limit.

B

Second, we can think about how we actually go about measuring the no bra flops in our code. That is, our code, might have things like divide instructions. Well, most instruction set architectures, don't actually incorporate a divided instruction, but map and divide into a sequence of floating-point instructions. That means the total number of instructions you're executing is higher, and thus your empirical flop rate is higher than what you might have calculated by simply looking at your loop nest and counting flops.

B

That means that our empirical AI has actually increased in our empirical flop rate has increased, which means that, although we didn't really move closer to the nominal roofline limit, we did move closer to the peak to the Machine Balance. What's.

B

The second we can think about what happens when we include all the cash effects or all the the data movement effects. That is when we go to actually measure how much data we move. We don't want to just look at our loop nest and calculate flops or bytes. We want to think about how many bytes were actually moved to them from the memory subsystem in some cases due to cache effects.

B

This may be actually quite high, and we must might thus see a decrease in arithmetic intensity, which may actually make us very, very close to the nominal roofline. So in this particular case, just as a recap using the empirical roofline toolkit or some other benchmarking technique lowers the model, it brings the roofline the lines themselves closer to the applications characteristics well. Similarly, measuring the air applications. Actual data movement and actual flops gives us a true sense of how close it is to the real machine capabilities.

B

So the next aspect of why we might be below the roofline is centered around the cache hierarchy. That is, we may have bottlenecks in the cache that are actually more constraining, and even so, if we think about our memory hierarchy, we have, in this case sea view we might have registers in the CPU core itself. We have an l1 and l2 and l3 and and DRAM to where we have locality at each of these levels. This means that we have an Associated bandwidth at each of these levels.

B

We also have an Associated machine balance at each of these level that it is for each of those levels. We have the peak flops of the machine divided by the peak bandwidth at that particular level. By corollary, we still also have an Associated data movement for each of our applications, so for each of our loop nests at each of these levels, that is a given lutenist will have a unique number of l1 bytes or 2 bytes l3 bytes and D Ram bytes.

B

That means that a given loop nest also has a unique arithmetic intensity for each level of the memory hierarchy. That is, you have for your loop nest. You have an l1 intensity and l2 intensity, l3 and DRAM intensity. That means you can think about how we might extend our nominal reply model. That is, if we think back to our original equation, we might define AI with a subscript based on which level it is. We could then add an additional level, so this basically says our attainable.

B

Flop rate is going to be the minimum of the peak flops of the machine or the product of a I and D Ram bandwidth or the product of L to a IML to bandwidth, and thus we can just keep adding terms to this equation, basically defining more and more bounds on performance.

B

How do we go about route visualizing this? Well, we could think about having a figure where we try to visualize 15 different bounds on 15 different figures, but we actually it's actually much much easier to actually plot them on a single figure. So in this particular case we have what is called a hierarchical roofline model.

B

So we start out with our original roofline, which has the HBM bound and the peak flop bound, but we can also add an additional bound based on the l 2k ash bandwidth associated with the L 2 is in the L 2 intensity for our given application. So for our application for our loop nest. Remember these two dots are exactly the same loop nest. It just happens that this one is the AI or the L 2. This one uses the AI for DRAM.

B

The thing is is that we can never have two different performance numbers for a given lutenist. That means that what we will actually observe is for a given loop nest. They will always have the same y-coordinate, but they will have different x-coordinates. That is, you have an x-coordinate for your l2 intensity and an x-coordinate for your HBM intensity. In this particular case, what we observe, because we are bound by l2 bandwidth, we will see that the DRAM intensity, the DRAM performance, is well below the DRAM bandwidth of the associated machine.

B

We could also imagine a similar case where we have reversed things and we actually have much higher l2 locality. Now there are a few things to actually observe when we actually use the hierarchical roofline model. When you see the x coordinates of your Luton s, AI, that is, if l2 AI is very, very different than DRAM a tie that says that you actually have very, very high reuse in the l2 cache. So in this particular example, we are moving orders of magnitude, more bytes to and from the l2 than we do from d-ii Ram.

B

That says that we're getting really really good cache locality in the l2 and only a few bytes actually have to trickle out all the way to DRAM. Conversely, we could imagine running a different loop Ness, where we actually have no reuse. That is every time we move a byte to and from the l2 we end up moving a byte to and from Dhiru that basically says that the l2 is doing nothing for us.

B

It's not doing any kind of bandwidth filtering, it's not doing any latency filtering and all we're doing is streaming data through the l2. So when those when the AIS are widely separated, we have high reuse when they are very, very close together, we have no reuse. Having no reuse is not necessarily a good thing, because it says that you're not really making good use of that inherent cache architecture, that's in every CPU and GPU. You really want to be in that scenario where those AIS are widely separated.

B

So the third aspect of why we might be below the roofline centers around in core FX. This is really geared towards the instruction set. Are we using fuse multiply, add vectorization, tensor cores so vectors by themselves? Have their limits that is a vector? Has a finite applications, have a finite amount of data level parallelism when you use a vector machine, the register file energy, basically scales with a vector length. There are a number of other constraints that say vectors eventually taper out in terms of their performance. The death Moore's law is really reinvigorating.

B

Some facets of complex instruction set computing you're, not gonna, get back to the kind of complicated load architectures, where you're mixing loads in compute I think the the load store architectures are here to stay, but what you will get our very, very complicated compute instructions. So this started out by having fuse multiply, add instructions where you have a single instruction that takes two values, multiplies one of them and then adds it to the second one, storing it into a third that can be extended, obviously into a vector version.

B

You can then go from that version into what is called quad FMA that occurred in the x86 instruction sets. These are basically matrix, vector, multiplications in a single instruction and then on GPUs. You have tensor core instructions where you might have multiplication by two small matrices adding to a third matrix in all of these cases. These are a single instruction or could be a limited number of instructions that do a large large number of operations. But this means that instructions are now going to be.

B

The instructions in an application are really a mix of scalar instructions which could be predicated on a vector machine, vector instructions, matrix operations, and that means that performance is now going to be a weighted average of all these different types of instructions. That is a scalar instruction, might only do one floating-point operation, a vector instruction, might do 32 operations, a tensor instruction might do 128 floating-point operations. You have to add all of those up to understand whether you're getting good performance.

B

So if we consider something like a voltage GPU, we have ballpark 100 teraflops of FP 16 sensor performance. We have something like only 15 teraflops of FP 32 performance and if we get rid of FMA, we only have something in the seven-and-a-half teraflops of FP 32, add performance, any kind of deep learning, applique we'll be a mix of tensor operations, ft 16 operations and FB 32 operations.

B

That means that your deep learning performance may be well below the nominal tensor core peak, because it's having to average together instructions that are FP 32 ads at pwf, mas and FB 16 WMAs. In essence, the mix of the actual instructions imposes an effective ceiling on performance, and the real question then becomes: how close are you to that effective ceiling on performance?

B

The fourth aspect is FPU starvation. That is, we have assumed to date that the FPU we can it's just a question of how fast we can give instructions to the FPU and that's our gonna, be our limiting factor. Modulo data locality, but the reality is, is that processors have finite instruction decode issue a bandwidth which means that the number of floating-point is units. The number of ft use dictates the fu rate required to actually hit that peak performance number.

B

The ratio of those two is the ratio of floating-point instructions required to actually hit that peak performance number. So, let's consider an example, let's say we have some four four issues: superscalar CPU with two floating-point data paths. That means at least 50% of our instructions have to be floating-point to have any chance of getting peak performance. If we have only 25% of floating-point and let's say, 75% integer, our performance will can never exceed 50% of peak and it falls progressively from then.

B

So if we have applications that are dominated by internet instructions, we have to really take this into account because we are not going to be a compute limited for those classes of applications. In the worst case, we might have an architecture that has two floating-point data paths. That is only two ways. Superscalar in that particular case, you might need a hundred percent of your instructions to the floating-point to have any chance of getting peak performance, which is basically never going to happen.

B

If you're in that regime- and you have only 25% of your instructions being floating-point, you're gonna be getting a very, very low fraction to peak performance, even if you are well past the machine balance, so this gives rise to a different version of roof language, which is how do we think about roofline, not geared around floating-point operations but geared around instructions? In this case, this is the instruction roofline model. I have the reference here to the paper that we did last year. So how do we go beyond the flock centered roofline?

B

That is when your application, as we could think about how we might classify applications. We have those heavy floating-point applications, that's actually kind of rare within VOE. We have applications that are a mix of integer and floating-point operations that that's more common. But then we have these emerging classes of applications from bioinformatics or graph algorithms, where they may be integer only computations, that is, they have no floating-point operations. If you have no floating-point operations, your arithmetic intensity is zero and you can never even plot a zero arithmetic intensity on a log-log roofline model.

B

The other aspect is a different way of dealing with mixed precision codes, where they're, rather than thinking about how you do a weighted average of flops, you'd. Think about how many instructions you're executing so I will note that one way that Intel Advisor did dealt with this is they went from just doing floating-point operations per second to integer operations or flops plus integer operations per second, which is useful when you want to understand performance as operations per second rather than bottlenecks in the machine of instructions per second.

B

So what we really wanted at that point was a instruction roofline model, not an integer operation, roofline model, so the most basic way of doing this. This is on a simpie machine. You might consider a vector my crops instead of flops, the vector my crops can be easily mapped to any kind of vector unit utilization.

B

100% utilization basically can bottleneck performance.

B

The other advantage is that when we deal on CPUs, most of our performance counters don't give us full box, but they actually give us vector microbes, which makes it an easier transition to constructing a roofline model. The thing to keep in mind is just because you have full utilization of your vector unit does not imply full peak performance, because peak performance assumes that you did FMA, you did vector operations, you did tensor operations well vector unit utilization. Just says the vector units are busy all the time they could be busy doing inefficient instructions.

B

So, in this particular case you know we might start out with the traditional roofline model, which has bandwidth and flops. We have a nominal arithmetic intensity associated with it and a performance well below that number. We can think about moving to a vector, microwatt version. We might have the same bandwidth, but now we have a peak vector my crops per second rather than peak flops per second. This is basically taking how many operations are in an instruction and divide it. You know this means we have a potentially different machine balance.

B

We have a potentially different AI associated with the number of instructions for the number of micro ops that were actually executing. When we actually look at that version of roofline, we may actually get be getting a very, very high fraction of roofline of the mic or Rock roofline, rather than the nominal flop roofline.

B

So the question then becomes: how do we take this kind of formulation and apply it to an NVIDIA GPU? Well, we might not have vector micro ops. We probably have warp instructions instead, but then the question becomes: do we want to do instructions per byte or something else? This gets into the question of what's an instruction on the GPU. If you do the more thread, centric version. Well, then you hide some of the issue limits. If you do the more warp centric version, then you hide some of the predication effects.

B

The solution was basically to scale the number of non predicated thread. It worked sized ie divided by 32 and show it in terms of warp instructions per second. You can, of course, then break the sound into subclasses, just integer FP 32 load/store, whatever to understand bottlenecks in individual functional units rather than bottlenecks at the instruction issue, the Warped issue rate so naively one might think you ought to use bytes and that would match the existing roofline quite well when thinking about intensity.

B

That is, if we did instructions per byte bets, are our direct translation from our original flops per byte. But the reality is GPUs access, memory using transactions and those transactions might be 32 bytes for global local memory. They might be 128 bytes for shared memory, so we ended up the deciding to use instructions for transaction as the means of understanding both machine balance and application intensity. This preserves the kind of traditional concepts of the roofline model, but it actually ended up allowing us to think of new ways of understanding memory access.

B

So this means that we start out with our original flop. Centric roofline. If we have integer heavy codes, we want to actually transform this. We think of it as gig instructions per second, with some kind of instruction intensity rather than arithmetic intensity. And then we can modify that to be both warp instructions and think about how we would map this if we actually dealt with transactions instead of bytes. This means that, for the instruction roofline model, we have the peak instruction rate of the machine.

B

We have in struction intensity in terms of warp instructions per transaction, and then we have DRAM bandwidth, measured and not in bytes per second, but in transactions per second.

B

We can then basically plot this for roofline for a voltage GPU. We get these numbers, it's just a different way of trying to analyze application performance, but what it really allowed us to do is think about how we think about global memory access differently, so rather than thinking of total instruction intensity, total number instructions divided by total transactions. If we think specifically about load store instructions divided by global transactions, we have a very special meaning in this particular case. This allows us to understand the efficiency of global memory access.

B

We actually can observe that there are three very important intensities load, store instructions, /, look global transactions, basically mapping to what our memory access pattern is like. If we're doing fully random access, where every thread in a warp is just accessing a random location in memory, we're basically doing the same thing as if we were striding by greater than hundred 28 bytes, that's the minimum intensity we can ever have for load store intensity.

B

Conversely, if all the threads in a warp access exactly the same memory location, the thing only a single transaction is required, and thus we can have a very, very high intensity of one instruction per transaction somewhere in between. We have the unit stride memory, access pattern, where our warp just walks through memories, sequentially the threads in a warp access to point consecutive memory elements.

B

Why is this important? Well, we can think about those walls as constraining performance.

B

We have the unattainable a low to the left and the unattainable a high to the right, and if we actually talk a lot or applications intensity, we may actually see that our applications are actually accessing memory and global memory inefficiently, that is out of the box. Our application may be accessing memory close to a random access pattern through some optimization efforts. We really want to think about how we can transform that intensity to improve it, to get it close to the unit stride intensity.

B

We can do the same exercise in shared memory and think about how that same kind of concept, shared load, store instructions divided by shared transactions allows us to quantify the number of bank conflicts that are actually occurring. That is, if all the threads in a warp access all this same Bank in shared memory, you're going to get a 32 Way Bank conflict and you're gonna generate 32 transactions. That's really low performance.

B

Conversely, if all the threads in a warp access a different location in memory, a different Bank in shared memory, you'll only generate the one transaction and you'll get high, shared load store intensity. So at the same way we can think about plotting our applications intensity. We can think about how optimization may improve intensity. So if we look at a smith-waterman type example, we may actually observe in the naive implementation we actually get kind of modder for it a instruction throughput.

B

If we think about what the actual load store global load store efficiency, is we actually see that it's actually rather poor? That is we're doing almost random access in the naive implementation and, conversely, we may have no bank conflicts. The optimized implementation may a do memory coalescing. This allows us to move from a strided memory, access to a unified memory access and thereby to get better performance. Once again, the number of bank conflicts didn't really change.

B

So the major takeaway is is that that kind of transformation allowed us to change, gather scatter into unit stride and get near peak instruction. Throughput I'm gonna skip over the example for matrix, transpose and move on.

B

For the instruction roofline, the traditional roofline is really about telling us about performance. The kind of use of FMA cyndi vectors really has no effect on intensity, but it can increase performance. What the instruction roofline does is. It tells us about bottlenecks, whether those bottlenecks be in the issue rate or in memory we can use the any kind of use of FM, AMD or vector, actually decreases intensity and may actually decrease in performance. Any kind of integer instructions may actually increase instruction intensity and increase instruction throughput and then.

B

Finally, the memory rolls really tell us about the efficiency of memory access when you're on the far left of one of those memory, walls you're doing basically random access, if you're on the far right or in the midpoint you're, actually getting very good memory. Efficiency.

B

One of the other ways you can be below the roofline is: you are under utilizing the parallelism of the machine. So if we think about running a traditional thread, scaling experiments on a CPU, we may for different problem sizes, scale up the number of threads and observe the differences in flop rates. Remember this is a log-log scale, so we can actually see where, as the blue problem size, saturates in performance, the green problem size actually falls over in performance.

B

The problem is: is that this kind of formulation, this kind of way of thinking about thread scalability, doesn't really tell us anything about what went wrong. Why did the green a problem size actually see a turnover in performance and see lower performance as we increase the number of threads, so one of the things Chloe brahim did was to actually take roofline and use it to understand, process or thread scalability. That is basically you're.

B

Doing a 2d, scatter plot a trendline function to understand how performance and arithmetic intensity changes with thread concurrency, so, whereas the blue line in this case may actually see substantial increases in performance, although through every different concurrency between 1 2, 4, 8, 16, 64 threads, what we actually observe is it's actually losing instruction etic intensity, that is, the arithmetic. Intensity is starting to to wane. Conversely, the green and red problem sizes see ideals, scaling, that is for a range.

B

They maintain constant arithmetic intensity, but eventually they blow up to cache and performance begins to degrade that is you're getting more data movement in the bandwidth-limited regime, which means you fall down the wrong direction along the bandwidth diagonal.

B

This can also be applied to other Nass benchmarks. It can be used to understand the difference between open ACC and CUDA, and I will point people to this paper from bench from last year, which actually won a best paper for understanding the differences between these different programming models on different NASA parallel benchmarks. So to provide a recap, what roofline is really doing is its bounding performance as a function of arithmetic intensity. That is roofline itself has those horizontal lines. Those are the compute ceilings. It has the diagonal lines.

B

Those are the bandwidth ceilings, the almost invariably. We will always plot these on a log-log scale, because it makes it very easy to understand and collectively these ceilings. These lines define an upper limit on performance.

B

You have arithmetic intensity, which is going to be unique for each lutenist. It is unique for each level of memory, and it is that measure of data locality, the measure of temporal locality. It is the ratio of the total number of flops that your lutenist performs divided by the total bytes. Your loop nest actually moves when we plot on the roofline. Every loop has one dot per level of member of the memory hierarchy.

B

So if you have ten major loops and four levels of memory hierarchy, you have 40 dots that you might have to to plot more than likely you'll only plot a subset of those. At a time you might plot only the DRAM once at a time you might plot all of the four levels for a single loop nest at one time that cuts down on how much data you're actually having to visualize when one of those dots is close to the ceiling that indicates, you are likely seeing a performance bound.

B

The position of those dots relative to each other within a loop nest is indicative of the cache locality. That is remember if your given lutenist, if your four dots for l1, l2, l3 and drm, are widely separated. That means you're getting great cache locality if they're all bunched together. That basically means you're streaming data to cache. All of these concepts apply equally to any kind of GPU or other accelerator.

B

So what do we use fruit line for? Well, there's the obvious thing of using it to understand the differences between architectures programming models and implementations. That is why do some architectures or implementations perform better than others? Why do some compilers perform better than others, but it's also useful for understanding and predicting the performance on future machines. That is. This allows us to set realistic performance expectations and focus on where we actually need to drive a few future architectures. That is in some cases we want more bandwidth.

B

We may want more compute for other applications or we want may want better instruction issue rates for other applications without increasing flops or bytes.

B

It's also, of course, it useful for understanding performance bottlenecks in trying to motivate software optimization. But, finally, it's really good for determining when we're done optimizing code. When you are close to that roofline limit, you really need to think about how you make algorithmic changes to move forward, because you're really not gonna, make substantial increases in performance when you're already within 90 percent of the roofline limit.

B

At the same time, you can imagine taking your performance today, your performance a month from now your performance three months from now and plotting it all on the same roofline figure, you can see a resultant trajectory and see how you're actually approaching the roofline limit. I will say that the model itself is just one piece of the puzzle. It defines the basic concepts and the basic equations, but at the same time you have to have system characterization that really defines where the lines are where the dots are.

B

You have the application characterization to define the dots, and then you have some kind of visualization and analysis tool and the remainder of this tutorial Sharleen will demonstrate how to construct the roofline model on NVIDIA GPUs focus really on system characterization. An application characterization max will demonstrate how to use insight compute to automate the roofline collection. This includes the GPU benchmarking, the application, characterization and integrated visualization, and then you will go ahead and use insight compute to analyze your own individual applications.

B

At that point, I am at the top of the hour, and I will take questions.

A

Thanks Sam, so I think we have a question from the chat.

A

The question is, why is the bandwidth bound boundary not intercepting the origin.

B

So it does intercept the origin. The thing to remember is that, because it's on a log-log scale, you can never actually plot the origin. You'll never be able to see the origin. The zero zero coordinate on a log-log scale, yep.

A

I think it takes a minute to realize that that you know like people would probably wonder about that when I started when they started doing reply, analysis, yep,.

B

Good point: it is a subtle point that we probably should add a slide to to help explain in the future. Yep.

A

So I don't see any other questions in the chat.

B

Okay, I'll be on the slack channel if people have questions or want them to task sure. Okay,.

C

B

C

Had one question the the roofline model you presented here for the GPU: is it all inside insight now? Is it a different version, an insight.

B

Insight will we'll always probably implement a subset of what it is. What has been done remember there is the kind of research activities which are kind of the bleeding edge. You know they go ahead and think about new ideas. Think about new concepts. Some of those pan out, some of them may not pan out. Some of them are broadly applicable. Some of them are more niche inside itself will take most likely a subset of the and incorporate them in okay.

A

So we have two more questions on the chat. Maybe you can take a look. I think we're a bit behind.

B

I'm not sure what what the K minus one is referring to. So.

D

The hi this meal, there was an example where you added new.

D

B

Inque then, on loop, iteration, K minus one. We already pulled that in okay.

D

B

Luke rotation, K minus two with any of those. The previous loop iterations have already pulled in that data right.

D

A

D

A

He stopped sherry.

B

Sure I was gonna answer the one more question: I think that was in the chat window, so you can always construct a operation centered roofline, which takes together both integer operations and floating-point operations, or you can do that in mixed precision. Nominally though, when we think about instruction roofline vs. flop, Reuters flying those tend to be separate concepts, we can think about floating-point instructions per second or total instructions per second or we can think about floating-point operations per second.

A

Okay thanks: we can. We have more discussions later on slag or here.

A

A

You guys can see my screen right.

A

So after the you know, the the theory talk I would like to kind of go through the practical mechanism as to how to collect and refine data, and today we're really just focused on NVIDIA GPUs, like Sam, says the general methodology, the roofline model works for all architectures. You just need to find you know the proper metrics, the proper tools to declare the relevant data.

A

So, and the goal here is to applaud a rough line like this, you have probably multiple memory levels on the architecture and different data Precision's and different instruction types. You may be executing the coup de coeur. You may be using ten circles as well, but essentially we want to have a you know, a very complex roofline like this.

A

Having said that, you know if you know, and then your code is not using say, the tens of call you don't have to you know, apply less replying here this one, so you may be just looking at you know the other Precision's or the other operations on the cuticle, but today we would like to provide a methodology for how to collect all these data.

A

So there are three steps. Well, first, we need to find out what the roofline peaks are, the ceilings and to get those numbers. We could just look up. You know the white paper from from the vendors and use the theoretical values, for you know the bandwidth or the pink flops.

A

We are nursed at LBL we have developed also a toward C, empirically measured the Pink's, and so, if you go to this link here, you would find this toolkit and this tool essentially sweeps through a range of configurations and runs a few micro kernels and the purpose that is really to stress test, the peak bandwidth, the pink flops and those micro kernels that really finely tuned.

A

And if those kernels cannot get say, you know that the peaks that we see from the white paper, then you know we really have to consider what the actual runtime environment is. Maybe you know the power is being constrained or you know some other things being constraining to to the pink. But, but by doing this we do get a more realistic understanding of what the peaks can be, because if those micro kernels cannot reach the advertised peaks and then we cannot expect that you know large Segoe HPC applications do so.

A

That's that's the whole purpose of this empirical or of line tool. Kids I'll talk a bit more about us in the moment that the first step is to you know, get the ceilings, and then we need to measure the basically the application data and to put the dots on the roof line and those dots have two coordinates and the x coordinate is the arithmetic intensity, and so you calculate those we need to measure the flops, which is the total number of floating-point operations carried out in the kernel and then the data movement.

A

So how many bytes have been moved for a particular memory level, so the ratio of these two would be the arithmetic intensity and the y-coordinate is the the fur pitch so flops per second, and for that we need to get the runtime for the colonel, so basically three quantities. We have to measure and I guess here: I have to say that for data movement, there's number of bytes KB for different cache levels. So, if you're looking at a hierarchical roofline, then you need to measure more than three quantities.

A

But the methods of you know calculating the arithmetic, intensity and the performance is the same. So um after getting all these numbers, we need to plug them up and be the more automatic way of doing. Those is to you since I compute, and we have a few section files you can use to to plug the roofline. But also you know the section files can collect these quantities as well so and the whole workflow is automatic.

A

But if you'd like to kind of customize the the plots a little bit and you can try those scripts we have here- and this impose a tree. Oh.

A

I'll give you more details about those in a moment, so so the first step, you know we can use that there, article numbers or we can use the empirical roof line toolkit since you get a more realistic set of peaks, and so this plot here shows how yotie works so basically slopes through a range of data sets and it measures the the bandwidth for each working sets. Also the flops and, depending on how compute intensive mikono you're using is, you could be, you know, getting the pink bandwidth or the pink flops.

A

So if you run ear to ear you, you will get a few plots like this and for these plateaus, you will see that you know for this particular graph. You you're actually looking at different peak bandwidth, say for HBM for algea 1 and for this plot you're. Looking at at the pink flux.

A

So then, how do we collect the application data? The the manual way of doing this is, to you know, use insight compute to collect these metrics yourself. I have listed the matrix here, also the scripts we have in this repository. I use exactly the same metrics as well, so you can. You can also you know, integrate those into your own workflow, and this should produce exactly the same results as inside compute.

A

So maybe jumping to this slide first.

A

So the metrics being used in a second piece a bit different than the metrics in the table that I have just shown. But the.

A

The the actual vlogs, the actual arithmetic intensity, should come out the same. It's just the way that you know we calculate we combine different metrics is a bit different, but in the end they shoot and be exactly the same.

A

So so I have to say this set of metrics come from inside confuse from Kuta 11.

A

We used to have a set of metrics of from and profs because, and they progress come with facing edge being replaced by insight compute in sigh systems. We would recommend you to really try the new set of matrix.

A

The I think previously, we have also published a set of metrics from q2 10, also using a second page, but this is a slightly slightly different than what we have now and if you're using. If you have access to create 11, we would really, you know, recommend you to try the new matrix, but these metrics should be equivalent to each other.

A

All right so coming to the last part, which is you know to actually plot and the replying charts you can use as I compute you, which automatically gets a brief line charge like this one thing I didn't noticed at the beginning, is that you know the the roof line. Charge is one per current one, one child per kernel. So if you don't see, you know the relevant kernel only charged, you may want to go to this drop down a button to see you know what are the kernels? You have profiles.

A

Of course you can, you know just profile the kernel you wanted, but it's just you know something. You may not know this and first glance, if you use the scripts here, we have, you know kind of put all the dots on the same chart. For example in here you see different colors for different kernels and then different markers for different cache levels, but these these scripts are really for the example we have in the repository, which is the GBP kernel.

A

If you know you're running your own code and they have different settings, it's a different amount of Kono's different amount of indications or the range of through page changes, and you really have to tweak the scripts. We have to fit your needs so.

A

To quickly show a few examples of and hierarchical flying charts we have, if you use inside compute. This is a very kind of typical gem example using tens. Of course, we have five different kernels in in this code and using inside compute you can see all the kernels have been profiled I believe these, so you know just two different indications of the same kernel, but you can see that we have three different dots representing the the performance for different cash levels. L1 note you and HBM the.

A

If you go to say the first two or three kernels, you will not see any dots because those dots, those kernels they have no flops. So there's no floating-point operations in those kernels, and in that case you wouldn't see anything on the on the roof line.

A

It. The same happens here with the scripts warehouse. So here you would see three different kernels showing up on the on the roof line and.

A

The the first to you, which is you know, just converting between different data formats and that wouldn't have any you know, flops and then generating seeds, and that also doesn't have any flops. So we have these three kernels here and three different markers for each kernel.

A

When, with the scripts we can customize, you know the names we want to pay down the plot. We can change the range of the x-axis or the y-axis.

A

We also, you know, choose to applies or not plot, some of the kernels. So on that in that sentence is more flexible than using on-site compute. That is really. You know how to you so another example of using the scripts too polite rough line.

A

Here you can separate the HBM l2 l1 performance from each other, so for this one you're just plotting the HBM data, this one's just l2 and one you also change the size of the markers and to be based on the the amount of time spent in that kernel or the number of kernel calls for that particular criminal.

A

So this is based on the kernel count. This is based on the flops performed in that criminal, and so all these plots have seen all these dots below the single precision pick, which is about 1415 teraflops, and when you have a different setting for the code, you know we can have. You know tens of cool kernels as well, so.

A

So there's these scripts can be really customized to to. You know satisfy all your platini's that you just need to do a little work. The the scripts we have in the repository are really the basic most basic ones, and of course you can, you know, employed the the whole optimization paths, so you know step one. We have the performance here and step two and as we optimize the the kernel further, we were seeing performance going up and up arithmetic intensity also changing between different steps.

A

So it really depends on how you want to visualize the data, and so so that's kind of the mechanism behind the refined data collection, especially on NVIDIA, GPUs and I, had collected a few questions before there's events, and if you have more, you can put your questions or you know if you have them bugs or anything like that in the Google Doc, and we can discuss this later in the hackathon this afternoon.

A

So with that so like to stop, is there any questions.

A

Nope, okay, mm-hmm.

E

Charlie and there is a question in the chat yeah.

A

So the slice will be ready. I have posted my hands them. Seven max will have some too right, so all of them should be available after the event and then I see three other questions. How to embed our work flows into new scripts. um I think we'll get into more details about this, but I can quickly.

A

Go through some of the readme we have.

A

So let me share my.

A

So this is the repository I'm talking about.

A

We have this GP Pecos. This is the input file for the GPB coach and we have two job scripts. One is using answer compute. So using this you can collect the profile. The profiles which can be opened using the insite continuing the the other script run. Dot customized is using the matrix I mentioned and it is you know the more customized way of collecting matrix. This is using the command line of inside compute. You won't get a kind of a visual and profile and to embed your workload into these scripts.

A

I guess you just have to you know, leave all the metrics there and I kind of change, the you know the Colonel Stevenson profile and the coach you want to run so, for example, here I want to change that onto your own code. I was saying for the for the other script.

A

So this this script and the other one they both run through the GPP example five times. So we have the baseline version and then for optimized versions. So max, will you get we'll talk about this after the break and I think you'll get more details there?

A

The second question it is there any DJ application in in the GPP code or just anywhere.

E

Maybe what I can do is run through, so we have some DJ well, we have some gem examples, at least from the kuda samples, so I can do that when I talk a little bit later, I can show how to use roof line. Analysis for that. That would help be able and.

F

Call it that's great yep. That was my question so.

E

Little closed-in completed.

F

Thank you. Okay,.

A

And just this kind of license to music says.

A

The spec um I believe so, and are you trying to profile these kernels these benchmarks.

F

Yeah so I think, like that's, my uh yeah I will be trying some of them, so I wonder like if we have I think they required some sort of license. So.

A

F

To like, if you can get like, if the not score the nvidia has a license for that one hour, I.

A

Think I think we do I know you work with us. We judge right.

A

He should know okay.

F

A

um Any other questions: next, do you see anything on this slack channel isn't empty? Asking me.

E

There was one question but Sam answered. It was a good question, though.

A

Okay, um then I guess we're dude for a break, we'll be back in 15 minutes and max will be. You know talking about the examples there. He just said alright see you guys in a bit.

A