National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 4 Tools: VTune Intel Advisor (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

Let's get started the afternoon session, so I'm songs is all from user engagement group a desk and I'm going to give this short talk about how to use video NAT nurse okay, so we tune is a intel profiling tool and the focus of it. Basically, it's all know the performance so, but it works with both serial and also parallel codes, so it provides both command-line interface and also the GUI. So, unlike more generically next cluster, you, the users like the vision, GUI to run your performance analysis and then also this player in with GUI.

A

But in our case we recommend to use command line interface, to collect data and then later display on the you know, on a on a login node. So the reason we do that, as because in in our case, you may run like a lot of large-scale jobs and we use a lot of node accounts and then it's not easily handled by you know interactive run and another little bit more technical issue is actually after we switch to slurm. Now it is easier to run GUI, but it has a little bit.

A

Historical theme back um will run torque moab. Actually the GUI didn't even work on our computer node at that time. So, but anyway, this is our recommended way. So you run the command line and collect the data and then later display on the on a login node. So for those people connected to nurse like from remote sites, then we recommend to use these NX to speed up the x11 applications.

A

So basically you can see pretty much faster, the the graphical display.

A

So actually, if you are very far from nurse, you may see the bigger delay in your graphical display, but they use NX. Basically that solves that problem. So the way tune is available on curry as a module like all our other software, so our current default is 2017, updated tool. The reason I'm emphasizing the the version here is because we tune keep changing its interfaces and how it looks and how it you know, display and stuff like that. So and also the analysis types can.

A

You know they can add more more and also rename and those type of things. So the the result I display in this talk. Actually they all from retune 2017 update.

A

Ok, so in this talk my focus would be just providing you a step by step. You know the just to provide you how to run with your non curry, K&L nose and then the rest of the thing I I would consider that your homework to do so, basically I'm just talking about how how could we tune get run on our systems because we have a nurse custom, the you know stuff. When entered the you know the custom stuff. You need to do to run this application.

A

So first you need to compile your code with a debug flag G over there and then another important thing. Is you also need to link the code dynamically which in our case our default is a static linking so you need to use this compiler wrap of red Dinah and then use them with our compiler wrappers to build your application. So once this is done, then you can run, but there are a couple of more notes for this compilation stage.

A

So you can see we recommended this debug debug in line debug info is a you know to be used. The reason to add this is you know in in the code compilation some of the codes in9. So it will. If you use this flag, then you will be able to get the you know more information from the inline codes, so another thing I want to mention is many of the profiling. Tools are debugging tools not profile into the debugging tools. Usually ask you to turn off the optimizations when you try to use their.

A

You know tools, but this one, it's okay, you don't have to so the you know you can just keep whatever you use for the optimization flag in your in your normal use of the code and then, although this is not required, but Intel compilers are recommended, so it should work with other compilers, but because this is Intel product I believe these are most mostly tested with Intel compiler. So this is just extra thing. We want to recommend so make your life easier. So sometime we run into many things. Actually it's not. You know well tested.

A

So this is a recommended, and the other thing I want to I forgot mention is so the synthetic build it's not necessarily like it may work, but we see many of the case like we've seen that called seg fault. If we, if we will be able to static, you know binary. So that's just a note. So the the way to compile code- this is I just take a skeleton cold is really small cold. This is a hybrid MPI openmp code, so you can compile like this.

A

So as a Helen already showed to compile for K&L you can you need to swap the default LOD, the creepy Haswell module to a creepy, mic, K&L module and then, as mentioned the us dynamic and then use this city and then also I used the dista bug in line info option and use this to build your code and then to run wheaton here is the nurse customization, so you need to use the as batch directive called perv equal to with you. This is a very important flagging. You have to use under the hood.

A

What is a flag? Adults actually is just to tell the pension system to prepare nose for you, so that those nodes can do the you know whatever with you and tells them to do so. Basically, it needs some way to need some kernel drivers to be loaded so to be able to collect some hardware event based. You know profiling data, so that's the requirement, but the reason we want to do this dynamically is because actually we tuned it's a it's touches like a very lower level stuff.

A

So we often see that now it's much more, you know stable, but when we first use this on create systems, we often see that it kills nodes and very fragile. You know so some of the current kernel modules that we can use. We don't want them to be loaded by default on all the you know, computer nodes, so this is the way we manage the region.

A

So it's like you start a job load of those kernel modules and when the app quits, then those allowed modules will be removed and then another thing is: you need to load the way to module before submitting the ABS. This is actually new edition is because now we support multiple kernel drivers on the computer knowledge, so you in your before submitted Java, you need to load or written module so that the the better system can load the corresponding kernel drivers.

A

So this is recently added, and another thing is, is the last you have to use the last a file system, I'm saying it's not necessary. Last a file system. Template is fine. It's just a memory. That's fine, but the reason here is: our global file system actually uses crazy, so-called DVS layer to access the to access them. I mean from computer computer know to access those global file systems, but these DVS it. It does not support all the memory map function and map function.

A

So something just doesn't work so this this last a file system is a require that you have to run the returns. You have out of this Laster file system. So now the newer version of we turn actual reports. So if you run that on the global file system, like project or like home, it will give you a nice informational message and ask you to switch filesystem. So this is a much better but back in earliest time. Actually it just fail and with some misleading messages, but now it's much better.

A

So here is the command you can type so just to get directory in the last file system and then load the module and then I'm, taking an example like using cell log so to run the interactive way to enjoy so after you get the computer knows then inside the job they on the computer know that then these are the commands. You need to type say: module load. We turn probably this.

A

This can be skipped because we already loaded from the outside, but and then let us say this is the example I'm getting here is an open, MP, MPI hybrid code. So we need to do this affinity thing here. So we put this to environment variable there and how many open MP says: I want to use so put this here and then the s wrong commander line is the same as you would normally run a code, but before your executable, this check that dot X.

A

This is your executable before that you put this with your command line, so those are ample X, ECL and then minus collect so after collect this one comes. This is analysis type. So now this line Excel tells region to do the memory, analysis, experiment and then the minus R tells where to store the result, and then this chase MPI is actually is needed for the MPI code so that they have one for each rank.

A

It will have one profiling data, one file to store the providing data, and then here are a couple of tips so for to run I mean to collect data actually written by default. It would do automatically finalizing data, which means once it collect raw data. It will process it on the before quit the job. So here we have extra option here so finalization mode equal to no equal to none means we want to. We don't want to do any post processing after the raw data collected.

A

So the reason we do this is because a single thread, speed on as much as slower than I could love in order. So if you do this on the computer, no it will takes a long time. Basically you can say it's two times it's lower on compared to the Haswell. So now we move this way deferred is 2d. You know we just collect the data and then do this outside of the batch job.

A

Okay, so another option I want to mention is this data they actually emit equal to zero. This one means we want to collect infinite size, I mean whatever the you know, the just just that you know collect all the data. Otherwise the default is a 500 megabyte, which means we turn all the stuff. After you know, data reach the 5 megabyte size so for the real application code.

A

Probably you run into this this limit very soon, so actually further I will show you some results from the material science code, the code of ASP, and then we see that before any iteration enters, then it already reached 500. So you may need this for your real runs and then these are another command. I would like to mention. So you can see the help information. So, let's, let's see, if you want to know what the type of analysis available then you can type impose your help and then connect.

A

Then it will shows all the options of this command-line interface and also the types narrow seas types available so actually for each analysis type. Actually, there are further options in something called knob option, so those will fine tune. You know what kind of data you would collect with with your experiment. So you can see this ample help collect an analysis, type, something like memory access. Then you can see all the available now.

A

Options for memory, analysis, type, experiment, yeah and then so, let's say just to show you what kind of analysis are available just see here. We have these advanced to the hotspots. You can see later what it what it looks like, but the names are pretty much explains them. It tells you know what what they are, but I think the most interesting one actually like advanced the hotspots and general exploration. A memory exists and also this HPC performance. One is a really good one. Actually, this is the result of a nurse request.

A

So it's a they didn't. Have this in the I mean earlier version, but they added this per hour request and then something like after you. You know something like this no option for the memory access, so there is like a knob analyze. Mem object equal to actually this one is critical actually, but unfortunately I forgot to add that in my experiment, so this one what it does is just to map the object with the memory you know operations.

A

So if, if OSS is some big operate, allocation goes on, they can map that or the specific object, but I I didn't include in my test, so I'm missing that data. But anyway those are optional. You can add per your per your need. So if you want to run better jobs, then this is the example. So you, if you want to run on cache mode, it just request this node and then just to provide this ample.

A

The way to on command line use this to collect data. So once the data collected, then you can using you can use this written GUI on the login node. You can display the result but to display. Actually, if you open, you just run this ample GUI command. You run, you run this command after you load the module we to module, then from the the main display you can see. There is a link called open results, so you click this. Then you can find what it where is.

A

Your result file is, and then you open it and then it will display the result. So if the data are not yet finalized, then upon the open, the the file, the GUI or the finalized first and then load the result or you can do the finalization outside of the GUI so by just around this command finalized and give them finalization mode to fold, and you can get the data finalized.

A

Ok, so next I will give some examples of what the region collect and what the the in cui interface looks like I think this one is is good to do a little demo over here. So now, I have I, have an application called the vest by desta mentioned it's a it's a bigger code at risk. It's actually, it uses the most computing cycles.

A

By the you know, we ran to the cold how they use our computer time in this one is the top one and then I'm going to show you so the the test I run we did with it is so once you open the GUI, the interface would look like this and then you say open result. You click this so I already opened the so I want to skip that, or maybe let's do that.

A

Let's close that memory see.

A

I think these are the result. Memory say open.

A

So this is this: you know the raw data has been processed. So when you open the result first, the nice thing you see is it tells you. This is memory.

A

This is memory analysis, so here it says this view is how did this collected and then it says, use it is to identify the potential memory, access related issues- and you know there are more so it can tell you what this view provides and then what you can get from this display. So let we can let it go, and then you can see the the first one. This is a summary review, its summaries.

A

You know all the things something like here so elapsed, the time this is the time used for the the wall, time used by the by the code and then actually this is very well optimized, the code, so you can see- and so if this is not very well optimized that you couldn't see multiple flags. So something like here. This is a the red flag. We can see only one here, it says it has a high l2, miss l2 misses so the nice thing about vision.

A

Is you just move your cursor onto the whatever you know things show up in the interface, so let's say l2 miss bound. This is what they use, and then they explain what it is. So the the l2 miss bound America shows a ratio of cycles spend spend handling out to mrs. to all cycles, so it defines what it display and then here it shows how to improve it. It says you consider hearsay what it is and then the potential way to improve this. So this is a very good part of retune.

A

So the other thing you can you can see from the summary report is bandwidth, wins utilization report so from here this is a histogram. So what etta shows is this is the elapsed, the time the the vertical exists, and then here is the it's. It's a show here. It's in gigabyte per second. So that's the unit. This is a dram memory, bandwidth utilization. So that means this graph says the code uses like one gigabyte per second bandwidth for the most of the execution time. So, if you put the cursor here, you can see band away.

A

This utilization is this: is one gigabyte per second and during you know, 92 second, at the bandwidth is your utilization? Is you know one gigabyte per second, so you can see even more like about here. You can see three gigabyte. There is only like a few seconds. It spends like 3 gigabyte. It uses that many bandwidth, and here you have option. Now we have a multiple hierarchy of memory structure, so you can see this one. Actually I ran as a cache mode. So you can see it shows the memory utilization for the MCD Ram.

A

So you can see at a 1, it's like maybe less than one. Second, it has to reach the like. Almost the the theoretical peak over here. No, it's a four times: no, it's much lower than theoretical. It's a 470 gigabyte per second, at least but at least we see at a certain point it reached like big bandwidth usage, but it's lower and going down, but the Sudheer, the majority of parties using like not big of bandwidth, and then there is a bottom-up view which provides more detail about the you know this test. So.

A

So you can see here the this is the bandwidth for the D ROM and then this is the time series. So this is along the execution time. What is the bandwidth utilization? So you can see from here the ROM. You know, I I believe this is the initial stage where we can try to set up the test and then from here it shows the you can see. Cursor vo tell like unpackaged 0. This is a theorem total at the disappoint.

A

It says the memory utilization is it's less than one gigabyte per second, this is a depend with the seals and then it also gives us a read/write and also give the MCD runs the result and- and things like this so lower part here. This is a CPU time. So, let's see, if you don't know what the CPU time you know they use a defined, then you can go back to the original summary from there. You can. Actually you can see that they even define such a you know.

A

A appears are like a really obvious one, but they do define what what they are so, but anyway, we go to this view, and then you can see this the pink pink shell here. Actually this is the the place we are we to thinks that the bottleneck is so you can see you you handle you put the cursor on it. Then you can see what it explains, what it is and then how you can improve it.

A

So here is the memory view and actually I can show you one more thing like something like this HPC performance characterization view. So from the summary you can see see utilization is a 17. This is a flat over here. So the reason it gets a flag here is because vasp cannot make use of the hyper-threading. So if you use hyper-threading it's much slower than it should so we have in total 268 right right, well, six to eight times the four-year we have that many 272 cores yeah.

A

We have that many cores, but we use only 64, so we two thinks the utilization is a poor. So it gives you this, but this is a nothing we can do about that it just it's the algorithm, it's the nature of the code. It just doesn't work well with hyper-threading, and then we go down to the lower Bart lower place, so it shows that the memory utilize bandwidth and then the memory and cache utilization over here. So we see the same. This is the same system, but I ran multiple times with the different analysis.

A

So you can see a similar report over here. It said how to miss bound is the previous one shows 24%. This is a 23, but anyway the shows it is, and you can also get theorem em cd-rom the result, and here it also shows that the system D instruction per cycle since so the the one interested in a measure over here is something called pack of the sim D instruction. So this one actually use this measure. You can see how code is simply vectorized. How well the code is a simply vectorized you can see.

A

This is pretty well, vectorized called actually was, but you know developers they did a lot of work on. You know the vectorized is this code, so we can see the result of it. This is a pretty high I think so you can also do the bottom-up view and something like this. So I I run out of a time so, but the last thing I want to mention is there is a the retune sometime appears a little you know evolved. So if you don't want to pay, they had too much of a learning curve.

A

Then another easy one. You can try as a so-called application performance, it's an app shot. So this one is a recent product. It's just around the beta testing. This is open software actually provided by Intel, and this one has very nice. You know easy to use but provide basically information about how code performs. You know. You know information, something like CPI CPI means the cycle per instruction. So if this number is high, that means there are many stalls in the code execution.

A

So that's not a good one so that it flags over here and then also something like it Annelle. It analyzed the NPI time, openmp imbalance or not a balance or not, and then back end stalls and sim D and memory footprint and even including I, also I think this is a really good. You know high-level overview for the code performance so to use it you just you know, load the module run this script and then here you at runtime.

A

You put this script in front of your executable, so the only extra thing required is link dynamically. So that's the extra effort to use this okay. Thank you for your attention.

A

So any questions.

B

Hello, so my name is Thomas cassava I'm, a postdoc here at nurse and I'm, going to talk about another Intel performance tool, Intel advisor and in particular, about new roof line, features that have come up in recent versions of advisor and I just want to thank Johar Matveev from Intel who's, one of the main developers and he's working quite heavily with us to develop and test new features of advisor, and he has provided me with a lot of material that I'm gonna show here. So so I'll go in a bit of a reverse order.

B

I'll have an I have examples, but at the end, but I'm first gonna talk more high-level about what you can do with advisor so so advisor. Has this it's it's basically a tool for vectorization efficiency of your code, although it's kind of spreading in other areas as well. It has this five main steps, the first thing, and that is probably what what most people will be happy with is- is it sort of provides you with compiler diagnostics and performance data from your application by by loop and by source code line, and gives you information about?

B

How well your vectorizing, why it thinks you're, not vectorizing, why your vectorization efficiency might be poor, then the second step is: it gives you some advice on on how to fix your fix issues that it finds based on what what Intel thinks is is a good way to vectorize code, which can sometimes be useful. It can tell you things like you have a dependency here, that's preventing your vectorization, you should you can remove it easily by these transformations.

B

And that's that's basically like the basic usage of advisor. Then you can run additional Diagnostics to collect trip, counts and flops, and that basically gives you an idea of the absolute performance of your code on the system and that's tied to the roofline analysis that I'm going to talk about in just a moment and then you can do even more detailed analyses like dependency analysis or memory memory. Access pattern analysis.

B

So the basic workflow goes something like this. So you start by compiling your code and what Genji said about compiling for vtune is basically true for compiling for adviser as well. I think you can run it with with static linking I've done it and it hasn't been giving me problems, but you run with with minus G and and all your optimization flags on too. Obviously you need that for vectorizing, and so you compile your code, you run a survey.

B

You get some information from the survey, you might go back change things in your code run again and and work in this first loop for a while and then, if you after you feel you might need more information, you can go into this deeper analysis. Trip counts, dependencies memory, access patterns, but all of these will will run on the same binary and and on the same adviser project they like to call it, so you will just be adding more information to the same data set that you're collecting.

B

So this is a snapshot of the summer first step and what what you will see when you open up a result is a summer. So there are tabs on the on the toolbar that contain different things, but the summary will give you an overview about the performance of your code. So so some things are similar like it will give you. It will tell you your CPU time, but then you get metrics like how much of your code execution time is spent in vectorized code.

B

How much is spent in scalar code and then how well are your vectorized? Is your vectorized code vectorizing? So how much of the vector processing units it's it's actually using.

B

Then you will you will so it will select some loops up here that it sees are taking up most of your time and give you some additional information, and so you can basically in this interface you can. You can click on this and it will take you to the source code. Give you more details line by line this bottom part here is, is comes after you've run the memory access patterns.

B

I'll talk about that in a little while, then, what you can do here from here is go to the survey report and it will give you more details on on a loop by loop basis, so this so advisor likes breaking things up into loops, and so it marks which of your loops are vectorized with with an orange color and scaler by a blue color, and then for the vectorized loops. It will tell you what is the vector efficiency here on the left.

B

Then it will tell you what vector instructions you're using how much it thinks you're gaining performance from vectorization. What is your vector length and then on for the loops that it thinks are not not vectorizing or are vectorizing inefficiently? It will give you some explanation: why why this is not vectorizing and maybe what you can do with it? So again, there are links here, you can, you can click and it will take you. Take you to the more detailed advice.

B

B

The other part of this is actually you have another row of tabs underneath here, so you can go, you can go to the source code of each one of these loops and it will give you byline some of these same metrics I, want to highlight here the the code analytics tab, which is which I like a lot. So it gives you some some analysis on your on your code for the loop in a sort of compact way.

B

For example, it analyzes your your instructions, how what fractions of your instructions are spent accessing memory or computing, or maybe maybe a mix of those or something else, and it gives you a nice summary of of basically all the information that is here. That is here on the upper row, but in a in a sort of clear fashion.

B

So so from here you can, you can mark the loops, you want to do deeper analysis on and then run run those analysis so, for example, the the memory access pattern. So this so the basic survey has a pretty low overhead, so you shouldn't see more than than let's say less than 50% overhead to the to the execution of your code. The memory, access pattern and dependency analysis have huge overhead.

B

So you want to be a bit selective with what you run they run this on, but you can mark the loops that you think might be interesting for this analysis and analyze them and then here's an example of the memory access pattern. So it will go through your memory axises and it will give you the ratio of of unit stride, fixed stride and irregular stride, accesses and and well basically for vectorization. You want the unit stride as much as you can.

B

So you want to see as much more more blue here and red is irregular accesses, which you basically want to get rid of and again it it will. You can go into your source code and see where these accesses are happening and may and maybe figure out ways how to how to eliminate them.

B

So this is a quick summary. This was a quick summary of what you get in a in an advisor survey.

B

Then I'm gonna talk a little bit about the roofline feature, so so the roofline is a. Is this performance model that basically gives you an gives you an idea of the absolute performance attainable on your system and then based on some measurements of your application?

B

How much of that can you expect to achieve so on the on the plot here on the Left I'm, showing a cartoon of the performance bounds of some system that has two levels of cache hierarchy and and DRAM, and then is capable of doing scaler, FMA and sim D, computations and and basically this these lines here give you performance bounds that are set by the system and the exactly so. The y-axis here is: is performance in gigaflops per second and the x-axis is arithmetic intensity, which basically means how many flops you can compute per byte.

B

You move from memory, so if you need to move a lot of bytes from memory per flop, then your your in this area, where the lines are tilted and that means you're bound by memory bandwidth here and then, if you, if the opposite is true, you you will be just bound by the by how much flops the processor can can compute and you'll be here in the compute bound region.

B

You might notice that there's this quite large region in between where it's a combination of two and you need some more detailed analysis to actually figure out what's going on. But that's where advisor can help you so okay. So this is basically a bit of a bit more motivation. So so you essentially you you might want to think about using the roofline model. If you have this kind of questions, so you have an application, you've managed to speed it up and you want to know well.

B

Okay, can I speed it up further or is it running as fast as it can, or you you read from from the vendor specifications that this architecture can give you to teraflops per node and you're wondering why you're only getting one or less and then the other sort of branch where this can help is if you, if you're, trying to decide what kind of optimization strategy should I choose for my code, so should I work on vectorization parallelization memory alignment stuff like that, so so that's where we have been using the roofline model and basic and basically so what you do is you you measure these these lines of that of the system, and then you instrument your application.

B

You place it on this plot and you try to analyze. What is the limiting factor so and like I said before, usually it's straight forward, so you might, you might get get gettin in in some cases special cases you might analyze your application. It will sit here on the memory, bandwidth, DRAM memory, bandwidth line- and you will say: oh okay, so I'm bound by DRAM memory. Bandwidth so I need to improve my my AI to get more performance or to reuse my cache better to get better performance.

B

Basically, the lower here you are the more room you have for optimization and then the kind of optimization depends on how close you are to the memory bandwidth limits, so whether you need to work on on improving cache, optimization or vectorization, for example.

B

So so what advisor can do, because, in order to to place the application on the roofline plot you need to so you need to know how many flops per second year the application is computing and how much data it's moving from memory and an advisor can do both of these. For you and and plot the your application on the roofline. So so I have an example here from advisor, and this is something you also find under their survey tab. So you go, you go to the survey and then in 2017 and newer, newer versions.

B

You will have a tab here on the side that says roofline and when you click on it, it will take you to the roofline view and it also what it does so it it runs, a micro benchmark to measure the system you are running on and from there it gets. This gets these roofs and then it.

B

Instruments, your application and it instruments it by loop and puts each one of your loops on the on the roofline plot and- and you can customize this a little bit so you can, you can set thresholds for different sizes and colors, so the site, so the size and color of the markers here are, is showing which ones which loops are taking the most time in this particular application.

B

You can hover over the over the points like I've done here and on this red point, and it will give you the the numbers. Basically, so it will tell you how many, how many flops it's computing? What's its arithmetic intensity and how much time?

B

And then you can you have a view here that that takes you to the source code? So you can. You can look at the source code, you get some some of this information again line by line and and you can try to analyze your code so that way, there's a histogram on the right that shows you.

B

How many loops are are spending.

B

Like the how many loops are spending a given fraction of time so from here, we can see that there's one red loop, that's spending a large fraction and then a yellow loop, that's spending, 18, 18 percent and the rest is- is close to close to zero.

B

Okay, so what so? What it's? What roofline and what advisor is doing to compute this roofline is. It needs to measure that the time it takes to run your application, it needs to measure how many flops the application is computing and it needs to measure how many bytes it's it's moving from memory. So so this is done in in two steps, so the survey collection of advisor counts the time and then you need to run a second collection which is through counts that counts the flops and the bites the the flop count on.

B

If you run this on K and L, it is mask aware so on K and L. If your code is not using all the vector lines, some of them will get masked out and you risk over counting the flops, but but what advisor is doing actually is because there is no flop counter on the K&L, its count, its instrument, it's counting the instructions and working out the flops from there and and it's also taking into account the masks.

B

Okay, so when so you place, you collect this information, you place your application on the on the roof line. So what do you do with this information? You? You should see whether your application is compute bound or memory bound, and that should give you some idea of what kind of optimizations you can apply to that application.

B

So compute bound applications generally benefit from parallelization and vectorization and memory bandwidth, bound application from cache, reuse and memory alignment or or using the high bandwidth memory and care now so, okay, so I'm gonna go through now step by step. How you run advisor together, get the survey and the roof line, and so the first thing you you need to do is load the advisor module and that should set the path to the binaries. Then so we recommend running advisor on the command line and then viewing the results from the GUI.

B

Basically, in principle the GUI also lets you run the executable, but to do that, you'd have to run the GUI on the compute nodes and that's cumbersome, so we don't really recommend doing that.

B

So the so the command-line interface to advisor is is called this a Twix, C CL and the syntax is similar to be tuned, but not exactly the same, so you can also get a lot of information by by typing a Twix a CL help, but but to do the collection, you just specify, collect survey and specify the directory and then give it your executable and that's it.

B

So this this first step is the is to basically obtain sorry the the times spent in different loops and then sort of the vector performance metrics, and this should run very quickly on most applications.

B

There are a few extra tricks you can do if this is taking a long time, so contact get in touch with us. If, if this seems like causing is causing huge overheads, then for the for the roof line, the next step is to collect the trip run. The trip count collection and you run.

B

You run the trip count collection using the same directory for the for the out for the output, which is called Project der here, and the important thing is to is to use these flops and masks flag if you're running on K and L to accurately count the flops using masks and- and this will, this will count, the the flop counts and the AI- and this has a bit higher overhead so expected to take something like three to five times the normal run time, and once once this is done, then you should have all the information and the roof line plot should be available in the GUI.

B

So you can just go ahead and start the GUI on and you can either give it your the project directory as an argument or click like click on on open results and navigate to it. The other another way to view the data is, if you don't want to use the GUI is using.

B

This report serve a flag on the command line and that will write out everything that all the information in the survey into CSV file that you can then open in whatever text editor or sort of spreadsheet software you like and that's that's the basics. So if you want to run this on a on an MPI application using or using our slurm scheduler, it's it's not that different.

B

The only thing you have to remember is that you have to call s run on a computer note and and then give them your normal arguments to s run and then just put advisor as the executable and then give your executable as an argument to advisor like like this example show here so otherwise this is pretty similar. You might also want to add the data limit flag to. So, if you set data limit to zero, it basically means that there is no data.

B

Limit advisor can can write as much data as it wants, or you can set it to a large number like 10 gigabytes, or something like that. If you are worried of filling up the file system.

B

Okay, so that should be enough to to get started with advisor. So there's I've collected some links here, so the first one is our nurse advisor page, which is actually really good. There's a lot of good advice in there and then I've got a couple of papers about the roof line and then a lot of Intel and nurse resources on on using advisor.

B

So that's all for me. I can take questions. If someone wants to ask questions.

C

So one thing I saw mentioned that the profile of you loved it was not using performance counters alright. So if you have a application with a difficult to predict an acute case bones with using caches, but perhaps it's you know, index T, something not as obvious. Let's to work will not be able measure density so.

B

So, oh, that's a good point so that what what the current versions of of advisor mean by arithmetic intensity is is all the data that is that this moved from any level of memory hierarchy into the processor. So it's you could you could say it's an l1 arithmetic intensity. So so it will collect everything where we're working with Intel very hard at the moment to get a version that would count the arithmetic intensity just for traffic out of DRAM, because that is is yeah more likely.

B

A bottleneck and well I mean this to have sort of different uses and and they're actually both sort of they complement each other. But this seems to be more difficult for them to implement. So so it's I hope it will come in in future releases. They have a beta version, that's kind of working at the moment.