National Energy Research Scientific Computing Center (NERSC) NERSC User Training, September 2022, 28 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 06 - Performance Tools

Description

Part of the NERSC New User Training on September 28, 2022.

Please see https://www.nersc.gov/users/training/events/new-user-training-sept2022/ for the training day agenda and presentation slides.

A

uh Well, good afternoon, ish everyone um I'll be giving a talk about Performance Tools, uh in particular the ones that we are going to look at so again, just to uh briefly remind everyone and, as everyone might have uh already been uh hearing for all the talks, uh Cody is gonna be decommissioned shortly, and so we want all our users new and old, to start working on Palmetto as soon as possible start migrating to Paul Mater, uh and so today we are going to look at the performance tools that that are available on polymater uh note that there are two major omissions in this list, uh namely Intel advisor and Intel v-tune, and we no longer have those because the architecture on polymer is different compared to query, and so an entire list of all the performance tools that are available can be accessed at this particular links in the stocks.

A

um Primarily, we have seen users get interested in using perf tools, which is uh provided by cray nvidia's inside systems and inside compute, arm map and arm performance reports. Previously, it's also known as DDT for data purposes. In order to measure data performance we have used darshan, I o profiler and timery is a is a very high grade. High Fidelity profiling toolkit. It's not a profiling tool.

A

It's a profiling toolkit which can then be leveraged to measure whatever performance that users are interested in, but for us today we are going to look at a brief primer on how to use gray pad perf tools, uh perf tools, light and inside systems and inside compute note that both cray perf tools can be used with the wrappers, so the compilers that are supported uh you can use MPI for MPI codes.

A

You have um you have the create compilers and also it supports Fortran using the ftn compiler foreign systems are exclusively used for GPU profiling and it supports these compilers as well as it supports python.

A

So to start off or talk, I'll, we'll take a look at gray, part gray, pad profile profiling Tools in particular, we'll be looking at perf tools. So do note that create pad is specifically for use on Clay machines and mostly the results that are generated are text based. However, there is Apprentice 2, which can be used as a GUI to look at the results.

A

uh The modules that are prerequisite or required to use perf tools is perf tools, base which is required uh before loading perf tools or curve tools light and on Paul Mater. It is uh loaded by default.

A

uh Puff tools is the full suite and post tools. Light, as the name suggests, is easy to use and is used for quick analysis and many times it might be adequate. So, let's take a brief look at how do we use perf tools, light and so for for for the case of uh this presentation, we are using a Jacobi solver written in Fortran with MPI and openmp support.

A

uh One of the requirements uh of using perf tools is that the code must be run in scratch um or you should set an environment to run it from elsewhere. uh Second, whatever object files that are generated while, while compiling the code um should be created in a separate step and must remain present, because that's how the analysis will give you give you plots and charts uh to visualize the result. As I mentioned, you can use app2, which is apparentus 2, and in order to view those, uh we recommend that you can.

A

You use the NX, no machine or, if you're, in a pinch and in a hurry, you can launch uh the terminal with text forwarding uh to run um to profile your code with puff tools of sorry puff tools. Lite. These are the steps, so the first step is to unload darshan and exalt as they conflict with uh with the metric collection.

A

uh You then load perf tools, light following uh that we use the programming environment three, which contains all the compilers um that that we are looking at uh and in particular, we compile the Jacobi solver in two steps. So the first step we generate the object files and the second step. We use those object files to generate the executable.

A

Then we request a four node interactive job and profile the code, so the profiling step. You don't really have to modify your code, you don't really have to specify anything other than just run the code.

A

So when you compile this code, two executables will be generated, one with the name that is given, and it already contains the profiling information uh required required to for post tools to work with and the other one other executable, which will be, um which will be called Jacobi, MPI, OMP plus orange, which which stands for the original compile, and it doesn't really generate any possible slide information when you run that, uh and so the text-based results that are generated from perf tools.

A

Light uh will look like this, so the first uh chart or information that that will provide it will give you the number of uh ranks and the number of resources that you have used and also give you the information of the uh of the architecture on which you are running. Another important thing that we notice is that we can get the I o information uh straight from a visible chart without any additional modification to the code.

A

um The first table now shows you the sample times and Sample times are measured in 100th of a second.

A

So the units of this sample time is hundredth of a second and the chart shows you the most time, consuming uh kernels or Loops that we have so in this particular example, uh we have this Jacobi mpiomp loop on line 61, which takes around 53.3 percent of sample time and additionally, the next most computationally expensive part is a compute diff uh loop, which starts on line 261., and so information has to be read in in this manner.

A

What you notice is also that it gives you MPI information, which is pretty handy. uh If you are code, needs optimization in terms of MPI, you can. You can use both tools very easily with very little modifications to the code, with no modifications to the code and get the information required for you to optimize the code.

A

This table shows you a line numbers with function, and so it's pretty much the same as what you had in in table. One, uh with the exception of it, will also tell you what file uh the function or The Loop belongs to, and so over here within Jacobi MPI Loop line 61 uh the loop is broken down and you can see what are the two time consuming aspects of this particular Loop and so over here we see that line 63 and line.

A

66 are the are the computationally um dominant lines in this in this particular function ah similar? Similarly, we see for compute, div and MPI function, so here this this is. This is another way of seeing how much time is required so many times. uh What might turn out is that the com, some of the computationally most dominant kernels, may not be the ones require requiring the maximum amount of time most of the time.

A

It is true, however, if it is um if it is run differently or if there are bottlenecks somewhere else, uh you table number one and table number three tend to look different table number four shows uh power consumption. This is useful for some of the applications where you need to engage on how much power requirement was. The power requirement is necessary to run the code or was utilized uh to get get the runs, um and finally, it generates two more tables table five and table six. We have not shown table five uh table five reports.

A

The average time taken and the number of bytes uh read from a file and table six is similarly average time and number of bytes written to a file, and since this code we didn't really read anything from a file, uh it doesn't generate table five, um it generates Table Six information and we have everything written out to stand out and it shows the write speed and the average bytes that were written, buff tools, light uh perf tools also supports similar analysis for GPU workloads and it's called perf tools, light Dash GPU.

A

It also supports Loop analysis specifically, and it's called perf tools. Light blue the table looks very similar, so here I'm showing you an example of of a Cuda aware, MPI code, um which was profiled using perf tools like GPU rest of the information, looks very similar. Although in this particular aspect you get information directly from kudam regarding Cuda Cuda, kernel launch and then copy.

A

um Do note, however, that this is just a brief primer and we are not uh delving deep into the details of each of these uh performance uh tools that we have available. A more in-depth analysis of your code um can be performed using uh perf tools, so steps are very similar to what you use for perf tools. Light again code must be run in scratch, and you you follow the same steps that you have followed for perf tools light.

A

uh However, there is a bit of a difference, so you need to after requesting an interactive node. You need to build um this, build your uh executable which, which was generated previously and use pad, build on it to get the detailed information out and it will generate a new executable called Jacoby MPI, Plus Pack, and when you run this uh do uh just a second sorry, sorry to take you back, but we also have uh this particular flag.

A

So if you are using uh perf tools to analyze a GPU code, you have to GPU MPI code. You have to add this flag, G, MPI and name of the executable. So so this is very similar to what uh what function, what syntax we are using to get both tools built uh for an MPI uh for a for a CPU code. You just have to add G flag uh to get it to run for a GPU code. Upon running um this particular executable with the plus packed um name, you will get a once.

A

You run it. It'll generate an XF file in the data folder in the data editor folder. You have to now convert these dot XF files into the app2 format, because we are using Apprentice 2 to read these files and the command is as follows: you you do Pat Report with Dash F flag for formatting it to app2, and once that's done, you can launch app to result and it will generate a window like this, and each tile within this window gives you specific information.

A

So the code profile here is the Jacobi solver that we looked at for perf tools, Lite and you can see it. It gives you a breakdown uh of the of the runtime of each function as a pie chart. Now. It also gives you the flowchart of the code or the flow of the code, which is really uh nice. If you want to use it for uh either documentation or you need to move it for refactoring.

A

It also shows you what part of your code on what rank was.

A

Is it running which directives so, for example, over here, the yellow part is the openmp part, and so we can see as as a breakdown of 100, what percentage of time on what rank was used uh for what purpose, and so it gives you a really detailed information uh which can be used uh to improve your code. It also gives you tiling communication information Mosaic.

A

uh So this is a mosaic now of time taken uh from uh source to what destination and how you can improve, how you can improve the MPI communication Time by analyzing your code using a mosaic uh shifting gears. We are looking at inside systems, which is a profiling tools provided by Nvidia for GPU workloads.

A

So Insight systems is again a low overhead profiler analogous to both tools. Light and inside compute is like a full-fledged of tools uh feature, and so it provides a broad description of GPU based application.

A

uh The only module required is Cuda or Cuda toolkit, and it supports a variety of applications uh which are written in either Cuda Cocos openmp open ACC python, but they must be run on on a GPU and application must be compiled using GPU libraries, and so uh here we are showing we will be discussing a workload of an open, MPA offload based application, which was compiled using client, plus plus compiler or llvm compiler, and again the code was run in scratch to visualize the result.

A

Again you have to use uh NX no machine or you can have X window forwarding. uh We do not recommend text window forwarding because it tends to make it extremely slow. Instead, what you can do is download the uh the files that are generated as a part of the profile and you can install both inside systems and inside compute, which are available for free on your local machine and then analyze your code using that, and so the Run steps are very similar.

A

uh Again, you compile your code, you request an interactive and then, in order to profile the code, all you have to do is just Add nsys profile stats. True, so this gives you information uh similar to what you uh just saw for perf tools, light, and so it generates Cuda API statistics in order to understand how the code can be improved and where the bottlenecks are in particular, it gives you what were the most dominant kernels in your workflow.

A

Do note, however, that if you are using openmp offload or code or any other, um any other GPU based API kernel names may be mangled, so the entire name for this function, which I have shown in this table, might be OMP offloading with a bunch of letters, and then it will show you the name of the function and what line it was it's located on.

A

So in this particular example, 61 of our time is taken by compute Yi kernel, uh which is online 471., and so um this is very beneficial in improving the runtimes of the code.

A

uh Although you can also improve your code quite a bit by making sure your Cuda mem, copies from host to device and device to host are kept at a minimum, so here we see the total time uh taken by just in Cuda mem copying from host to device, and the overall objective is to lower both the time as well as by consequence, you can do that by lowering the data transfers between host and device uh it it once you you, you have.

A

You can use nsys UI, which is a uh which is a GUI window to analyze the trace of your code, and this is this is a very useful feature. Once you load uh the load, the profiling report that's generated, which is the QD rep. You can then zoom into particular uh parts of the runtime Trace by selecting a brief time uh window here. We can see that in this particular instance, when um the Cuda API is running host to device, we get this on our timeline.

A

There are some aspects of the code which which you can improve just by looking at the trace. So here we see that there are gaps between one function being called so this is OMP offload a function followed by another function, but there is a gap and our objective, uh just by looking at um a trace, would be to figure out.

A

Why are there gaps and this point, what is a CPU doing so here we also get a CPU trace, and so you can compare GPU versus CPU, trace and figure out how you can improve um your runtime just by eliminating these uh gaps, where our idle times on both host and device. uh It also provides you more information regarding what were the resources. So if you select the events View at the bottom of the window and select the function, it will show you what were what were the launch statistics.

A

So uh what were the theoretical occupancy for this particular function and what thread was launched and um loads of other information.

A

Inside compute, as I pointed out, is sort of a more detailed analysis and um in in this demonstration we will do a pretty interesting uh comparison.

A

So we have a code which we have improved just by adding a collapse Loop, and so we want to understand how much improvement, just by providing this collapse collapse, Clause uh to the OMP pragma parallel for pragma, uh we can improve our code, and so this is a nested Loop, and these two for Loops are now being collapsed, and so we will run our profiling step for inside compute twice the first time, we'll just run the Baseline code and the second type separately, we'll run this optimized code and gather the reports for both of them, and so in order to get the reports, we can do ncu-0, and this this is the name of the file it will generate.

A

So you can call the profile snap case, one case two um and you you have to use SEC full in order to get the information regarding memory transfer as well, and so one thing to note. The key thing over here to note is that you have to switch the DC GMI profile to pause, because a profiling step is already being run on 12 matter by default, and so you need to switch this to pause. Otherwise, it'll give you error, so you uh users are interested in using inside compute.

A

Do remember this DC GMI profile dash dash pause. This is separate from query, and so, when you load these two files, you can set one of them uh as Baseline, and so what we have set as Baseline is the case one and what we see the case, one being the blue uh by just by improving uh just by adding a single collapse Clause to our function. uh Sorry kernel: we see that our compute throughput, as well as memory throughput, have tremendously increased and our runtime. So comparison. Also, you can see.

A

The duration of this particular function has improved improved by 97. It's 97, lower, uh compute, throughput and memory. Throughput have uh improved as well, and the reasons are also provided so here the L1 cache uh throughput has improved by 10 and the L2 cash throughput has improved by 142 percent.

A

um Then. Similarly, the dram Improvement, so you a single snapshot using uh uh speed of light analysis. You can understand the improvements in your code and make similar uh changes to your code in order to improve an overall runtime.

A

It also provides you roofline analysis, so you can understand that once um I making a change, your arithmetic intensity, which is the number of flops executed for a byte of data, transferred increased, because we now no longer have to move a lot of data to do the same calculation on our performance, which is flops per second also improved, and so in a single uh chart. You can get this information. It also provides you other other useful information uh regarding uh compute workload, analysis, what what sort of memory uh transfers were taking place.

A

So here it's it's sort of a repeat of the information that we saw in the speed of light, but it gives you a more detailed analysis um in terms of whether fuse multiplication and ads were increased or fp64 instructions where uh increased by making a change, uh perhaps a more visualized understanding on how the data moves uh just for this function uh is also provided, and this is a very beneficial feature, and so by adding a collapse class we reduce data transfer between device memory and L2 cache by 95 percent uh back and forth between L2 and L1 cache, we have reduced our data transfers by 93 and our cash hit rate has increased by 22 percent, and so all these information are pretty key uh in improving the overall performance of the code.

A

um And, finally, you can also figure out the number of instructions and how they changed. So this is sort of a very deep dive. uh It provides you uh additional information on what aspects of the code change by just making one change in your uh in your code uh with that. I conclude my talk uh I'd like to thank you for attending this, and we are glad to have all the new users and all the users migrating to Paul matters. Thank you.