National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 3 Getting Good Performance on KNL (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

So we heard a lot already quite in-depth things. What I'm going to talk about with all these options, you heard about basically I'm giving you like, like more like very like an overview, what what is really beneficial for many applications so because you have this place or are things like configuration options, but actually most of them usually won't won't help you a lot so I will basically tell you what helped for us and for us means from the nurse exercycle scientific application program. Point of view.

A

So we are what it is and I will just explain like in the next the following: it's month: okay, so, okay! So so, first again the differences between like, for example, edit land and what we have in quarry KNL. So we have twice as many nodes now. This is. This is quite throughput wise. This is quite quite quite good, but also note that, like we have now much more cores per CPU and as well as much more Hardware threads per CPU. So when you target a quarry KNL, you want to exploit that.

A

If you don't you, basically your quote once and for well, because the single core speed is like half as slow, okay, so half so check like twice as fast on engine, for example, compared to Coronel, is even worse for for a half long. On the other hand, you have much wider vector units, and you have two of these right so before. If you had codes which do not use the vector units a lot you had like maybe effect of forage difference which in the end, basically not everything vector, is Michael.

A

Usually, then it turns out to be factor 2. Maybe, but here, if you do not vector as well it you can, you can use a factor of 16 and performance. So it's like more than order of magnitude, and you don't want that. The another difference is talked about this already Edison or like Ivy. Bridge processor has 30 megabytes as free cash, which helped a lot in many applications more than you, you realize or not, and the KNL doesn't doesn't have that so and you can think about okay.

A

We heard that before, like the in quad cache mode, for example the the DDR excess cash, but this cache is only like four hundred fifty gigabytes per second fast, whereas an l3 cache cache still about a terabyte per second or something. So that's the use different and- and there are other caveats. On the other hand, you have this huge, sixteen gigabytes of memory which are like which, which we can pull data from a four times as fast as before.

A

So you want to make use of this, so you think, okay, what can we do just recompile the code and go right, so you have so a Candela 686, 64 compatibles. You can just create your 3, compile code technically and just use it as we su did before or not even recompile just use it technically. You can do that and also self hosted, so no need for offloading and complicated promise for this, so you can just technically just run it. So what happens if you do that at least so this virtual illustrators, these are selected.

A

These are codes from the Mesa program and this is the performance when you just take the original codes, compile them on on on KML and compare the performance on Edison or Cori caramels versus Edison. These are the gray bars and then do the same for Cori Haskell, and you see something like this, and this doesn't look very well. So the median media not average speed up, is 15% with respect to edit. If you just do this and the medians meter process flows, even it slow, it's slower than hassle, because hassle has a very strong processor.

A

So what you? What you see here is that some of these, so this is a median speed up right. Some of these applications perform very, very badly on KML. So, like you get like 30% of what you've got an editor, and so this is module. So this is like, like, like a cross section about different applications or special applications.

A

They are like, like pick a particular cell code and and and all sorts of different applications, and this is like I think so, if there's a high probability that if you have an application that is fit somewhere in this picture, so what you also can see is that some application benefit heavily. Those are mostly mostly applications which are bandwidth bound, so I will talk about this later.

A

So the thing is that you should optimize your code and why should you optimize the code? Okay, it's easy right. You get more for your bucks right, so you can use you can make efficient use of quarry canal and I can show you if you have like codes which are not really optimized at all. There's a fast success possible, because when you just look into this code, there lots of low-hanging fruit, you can just just pick and you might get a good speed up already.

A

On the other hand, also, it's not it's not completely pointless or worthless to optimize for KL, because he Tribune's architectures are energy efficient and they will stay around so probably future procurements, not only for our centers but like in the like looking at the HPC landscape. Nowadays, he genius architectures are the future. So I can tell you if you need to optimize your code to it now right.

A

So this is a good time to start, and it's not it's not like a transitional abrupt transition, as we just bought some GPU so that you have to learn cute and all these kind of things, it's more like a smooth transition, but you should definitely consider doing it and the good thing is even if you do not have access or you also use like many many courses, multi-core system, sorry multi-core systems like Edison or partition of quarry or like your University class or whatever most of the optimizations, which apply to canal, also give you benefit on on these architectures.

A

So you will definitely get a benefit for most of it and not not on loans or not only on KL. That's the desktop side, the downside, of course, and that I'm totally aware of the effort for most the most beneficial optimizations are really hard one. So you have to restructure code, restructure, datas, their data structures, probably and some codes are really problematic because they are, they are not right, they're not easy to make Fred safe in any way without major changes in this part.

A

So this can happen, but don't be too afraid to just think about it and, of course, this investing in future point is also downside. What is my bet on the wrong horse might have what is intercept in two years or we leave this HPC segment or all together. I can assure you there might be other windows which will step inland which have similar approaches, so don't hesitate like considering what I'm going to tell you. So when you optimize you can do.

A

We have certain things to consider like seeing a note multi-node, maybe I/o if you are a or heavy applications and your course should always start on single or performance, and why do you do that? Because it's a lot of reasons, the first? It's the easiest target right, you can you can you have to write a fast turn, runtimes for debugging and profiling, whatever, because you only need one node and also it's important, because even if you have like a multimode case and you own multi node application, you want to run it on large scale.

A

You are bounded by the local local execution, speed right. So even if you can hide the communication perfectly behind your computation, you will never be faster than a single node. So you should definitely look at the single node suite and there are also many profiling tools for that available. So we discussed a couple of already I will give like a more comprehensive list layer when it comes to multi node performance.

A

There are few optimization opportunities and the profiling is a bit tedious, because you have to deal with all these different processes and also debugging is tricky. When you do these organizations, you might find yourself in a debugging situation that, but we offer tools for all that, so you should not hesitate, trying trying out things that will mention later and for I/o performance there, not many things you can do at the moment, but a couple of couple of suggestions I can offer so for a single node case. So what do I do so?

A

I have an application and I want to know what the problem is. Why is it, for example, 2x lower on KL and Edison, so the first thing very important get to know your application all right, so I don't know who's on the line. There might be like some very specialized domain scientists, but in general, do not assume you already know your application, because even like I talked to even so I come from lettuce QCD background even there. Sometimes you ask what what is actually the problem here and then they tell you out.

A

I don't know justice. Maybe this is in your algebra routine, but actually, when you look at it, it's not the case. So it's important to determine hot spots right. So, for example, you can do that by manual timing, routines like you, take subroutines or like loops or whatever you think takes a lot of time and time it. You can do that with manual. Timers will be very careful to use self, read safety and and put in synchronization berries for MPI, but in general this looks like a very simple approach and easily portable.

A

You can also use profiling tools which do the job for you and what one I like very much is Cray pet. It's like basically loading a module and then you just decree if you use the credible for compiling they technically just annotate the code and in the end you will get like for all the major sections. You will get timing information additional to some MPI I think MPI, a message, API statistics.

A

So if you go little more involved, I think like advisor, because this guy can find time-consuming, loops and analyze them in the same step and vtune, for example, which can do a lot of lot of things, but it's very slow, so I prefer for this path.

A

I would suggest just trying, starting with Cray pet, just time and see what release that would wreak a consumer lot of time, and that's also something in between these here, which is map which are which is very nice and comparable lightweight, because that's more like a sampling approach, then bd2, it's really heavy weight and like a data collection of effects like big big kernels, can take a very long time. So this is also nice opportunity to try that out so assume you times your application. You found your hotspots. So what do you do, especially?

A

What features shall I target right? So you have an application, a hotspot which take like 90% of the time on your application and then yeah so shall I use many threats or try to vectorize that thing or try to go for the more complex, intrinsic and your application or like do I just move to m2 DRM or use them to do it more efficiently. So what shall I do from that? And the answer is understand these hotspots.

A

Now, okay, so I found them understand them and then for each of these cases you can end up with you have like certain options. For example, if you are compute about what that means, I will tell you later. Then you flow all threat so that you can you vector eyes as much as you can and you use the ISIS or the complex intrinsics.

A

When your memory bandwidth bound, you have to exploit the memory key or hierarchy more, and you definitely also want throw fretted it, because more frets can can situate the bandwidth better right. So a single fret usually cannot saturate the bandwidth fully and if you are latency bound, which is like a like a more complicated things, then usually what what can help, for example, more frets and vectorization. So I put yes every.

B

Equals more phrase: is there a resemblance with you yesterday, so.

A

I mean like more new technology. You do not want less friends. You will technically always try more processes on one thread. That's like my mic, shake on it so try to so. This is one one solution. L have I will show that exploit or like like try to identify and exploit all the pearls, and you can find right because you have a lot of it and you want to make use of it.

A

So, let's fight in the sense that it might not make sense to utilize all the hyper friends so that that's true, so what the hardware likes like one fredtrip core like, for example, on at least 64 course YouTube you should you try to want to utilize that you don't only fastest your Idol, so you have a friends different story because they share like execution. You know something. So that's that's a more tricky question but, for example, if you are compute bound and you can try that you can try using them so so what?

A

What do you need to understand these things if your memory bound bandwidth bond whatever been bound computer whatsoever? So, first you need to compile it run.

A

We set it already, so you for crave wrappers, you can technically just load the you can swap the hassle model to the Mike model and then just compile and it'll be fine for interim you can so you can do the same, but you can also manually by adding this optimization, that it optimizes to the avx-512 architectures and for MU you just pass and arch KL and use a proper open, P setting. So this is just an example.

A

Helen talked more and more detail about it a minute ago, so make sure that your binding is somehow sane right. So if course you can get very bad performance if you pin all your threads to a single core or like a couple of chords right that can really impact the performance.

A

So before you, you open, for example, support ticket, make sure that your MP, that your MP and you empty I binding, is somewhat sane right in order to get the same, binding again is repeat: use the JavaScript generator or like our like documentation, user help on the website and as a node configuration for everybody.

A

Who's not like very experimental friendly, just try, 10 L quad cash, and you should be fine for many many applications, and you don't have to worry about entity and usage for the moment, because it will be done in the background for you. So what else you need? You need to understand the number of flops, so the number of floating-point operations you execute not per. Second, it's the number like the total number of floating-point operations.

A

So you go to a kernel and count these things, so you can either do it manually, which is a tedious, but it's nice to understand somehow the order of magnitude right so for each float, an addition modification you count plus one and for each complex multiplication. You have six because they have four multiplication into addition: okay and then for loops. You multiply with the trip count and so on so that that you can do, and then you get an approximate flop count for this kernel.

A

You can also measure it with the software development and later toolkits by Intel, which works like this, so you have, for example, here inexpensive loop and then you you insert like in the so-called mark and enter Seema before and after and then you run this, you load the s de module and then run this with a and prepend this ste 64 there's also an ste without a 64, but that only that overflows, I think that only uses 32-bit counters or something so I highly recommend using this and with some some very cryptic magic command.

A

So we have the documentation the website how to do that, but just just technically just copy this, and we have this start and stop regions which basically say that start the flop collection. When, when the code execution passes this point and stop it here right and then you will technically just get the flops produced in that kernel and it also accounts for masking. For example, KNL supports like vector masking and things like that, and it will account for these things, so you will get a relatively precise flop.

A

Count from that, and it's pretty like it's a no-brainer. It's just that the runtime of this thing can be very large if your, if your profile a long section, so you should technically try to try to downscale the section and then try to estimate for bigger problems or for more realistic problems. What this flop count will be.

A

The next thing you need this byte, so the number of bytes transmitted from main memory. Okay for this kernel, so from main memory, not from cache okay, so you can. You can compute this manually, okay, you cannot so and we don't recommend it because you do not account for data use for caching right, so only so what can end on the opposite for a 4k by a so what can happen?

A

For example, if you estimate your by its usage by hand and say okay- and you assume that everything I need is only read once and written once, for example, that is usually a bad approximation because you can have you can have cache evictions and all these kind of things and you have to read the data multiple times for main memory. So it's good to have like an order of magnitude estimate here, but definitely I recommend measured with retune.

A

So, for example, there is there is this: you can annotate your code, so this is a see example, there's also Fortran equivalent, for that we will keep basically the ITT, notify library and then, before your kernel, you insert like this ITT resume and then, after that, an ITT post. M is the same idea. It's with ste.

A

You can even do that in one go, but I would I would I wouldn't separate this, and then you run it with V tune and collect the memory access and that precisely obtains counter events like the memory counter event, so really know what kind of data was concerned in that section.

A

So once you have this, you do the following, so you model your performance using the roofline performance model, and that says the following: first, you compute the arithmetic intensity, which is just the ratio of flops by bytes, okay, and then you compute the performance of your kernels, which is just the number of flops divided by the execution time.

A

So this is the time you got in the first first run when you, when you, when you put in your looked at the timings for all your kernels right, so we have that number, and then you plot this on the architectural roofline, okay and the architects roofline is given by the minimum of the memory bandwidth from main memory times your aromatic intensity and the peak flops right. So how does that look like? So this is a picture. This is for 4k L, so you have so. This is basically the roof line for DDR right.

A

This is the roof line for for high band memory or MCD RM, and this is, for example, the flip-flops you can reach when you do not vectorize your code, and this is like the maximum peak flops, you can, you can get okay so and what will happen when you compute this, this asthmatic intensity and the performance for one of your kernels. You will end up, for example, here.

A

Okay, and this, for example, tells you that your kernel has very low I and you hang at this roofline your memory bandwidth bound and probably you ran the code out of DDR because you hang out to DDR roofline. So in that case, what you want to do you want to use MCD REM to make your code faster right so low. Does the logarithmic scale right so this, for example, health here? What you also can find is that you have a kernel which hangs around here right at the vectorization roofline. So that means this here.

A

It really helps to try to vectorize your quote properly so that you can break through that and get a better performance. So when you do that, probably you you end up at the instruction level parallelism roofline, so this is a this is so this is when your quote does not use a fuse. Multiplied multiplied adds some matrix multiplications, you have a lot of use, but you can make heavy use of this use. Booklet or some Fourier transforms, but not every kernel can so that's a that is only if you exploit these.

A

You get another factor of two or something to get to the peak peak roofline, and then there is something in between and that's unfortunately, many kernels are like this. You are somehow I'm, not really memory, not really computable, not really memory bandwidth bounds. Maybe latency bound something like this. So what you can try is improve your fretting and the vectorization, and you might move up a little bit, but don't expect too much as we like a tricky problem.

A

You really have to have to in depth investigation as to follow what is really going on so, okay, I assume you did all that, but then you think. Okay, so wait. Are you happy with this guy here? Probably not right because it's at the roofline, so you cannot do better by right by simple changes, but your you still, your performance might be, might be rather bad. So the only way to improve that thing is now to improve their reading intensity right, because if you move in this direction, you have to like room left. Okay.

A

So for these things you need to really look harder and work harder. So what can you do so repeated the admitting tentative slops overbites to improve it, meaning to increase? It means you can, for example, improve, there's a really increase the number of flops and lift the number of bytes the same. That sounds easy, but it's not actually not, because the flocks are did like you're addressing determines the flops like the problem, your tackle and the ibrehem deterministic flop. So it's not easy to change the number of flops.

A

Of course you can put in like meaningless lots of multiplication with one so adding zeros. But this is not the point here right. We want to increase, improve the execution speed, so this is usually another viable way. What you go can do is you? Can try to keep the number of flops constant, but try to reduce the number of bytes read from main memory, and reality you are, is a trade-off between these these kind of things, but it's so this is so.

A

The second point is this is the way to go okay, so this is more in detail what you could do. First, you have a lot. Of course you have a lot of threads. Your vectorization try to reduce, and, for example, if you news of a pea, try to reduce overhead okay. So what I've seen a couple of times it's like codes, which are written like in a linear algebra sense. So maybe this is very very stupid.

A

Version of this I would say sorry no offense, but this is like you multiply a vector with a matrix store the vector. Then you kill you, you subtract a vector from another vector and then you compute the norm of it. This is more like how you think about it in mathematics, but this is not how you should write it right.

A

You should technically fuse these loops to one big one so that you have a lot of work in here and also you do not create like a fork and merge open few sections all the time right, because when you profile this thing, it will tell you that you have a lot of open and P synchronization barriers which you don't want to have, and you don't need and when you have a nested loop like the one below this is like a typical loop.

A

You have in physical, like in physics, problems whatever like this loop over like free dimensions, for example in or in. If you use open P, there is a nice statement called collapse which basically just flattens out the whole loop, and this and flows the frets like distributes them over the whole fuse of near collapse, loop, okay, so that is. That is quite good, because if you do not put this in, then it would only flow the frets at the outermost loop right.

A

So that is something you should you should think about, and another thing is you need to rearrange data structures? Probably so you can try to move to augmentee out so going from a fine-grained preservation model like this one to a more cost going preservation model. That is not always easy to do and requires a lot of maybe like, for example, storing intermediate results and erase, and things like that. But there is like it's a certain trade-off and- and usually you can gain a lot if you do that to a certain extent, I think.

A

We've also some case studies on the web pages, which which describe this so the next thing is loop tiling, so loop cunning improves cash for use and therefore reduces potentially the number of bytes read from main memory, because they are not for read from main memory, but from cache. Okay and cache has a much much higher bandwidth, so you don't do not really care about that. So this, for example, is a very, very simplified kernel of form from the quantum espresso material science code.

A

It's technically we have a matrix and multiply the rows of the matrix with this P vector and store it in another matrix. So the problem here is that I are or like NIR it's long. So this is very long loop and it is also very long loop and what will happen is when you go through the IRS. It will. Basically you see that that B is not dependent on J.

A

What will happen is that it basically screams through the a and B's and then starts again for the next J iteration, but then, let's, for example, the the B's earlier in the loop in the previous loop are already evicted from cashing you to load them again. So the trick is here- and this is always a good thing to do- is let block or like tie loops, so it makes the code look more complicated, but it can significantly really significant improve performance.

A

So what we did here was we defined a block size, so this might be architecture dependent or you might use the block size which works well on, like the architectures. You are targeting do some index calculation. So don't worry about it, but the idea is, you basically have no iteration over over chunks of the inner loop okay. So this is like this is iterate iterate over block, and here you iterate over a given block. Okay, and by that you can keep this bi a memory for all the J's, for example.

A

So this this is very important and especially because on KL we don't have the the l3 cache right. So this kind of codes like that the first loop I showed you usually the big l3 cache help you, because if you couldn't find the data in l2, you went to l3 and found it there, probably, but on K&L you don't have it so every l2 miss goes to DDR or like MCD Ram right.

A

If you go to main memory- and you don't want that and when you try to block these kind of things, you can try to block to shared l2. So an l2 is shared between two cores on the tile right and the shared means, like the fraction of a course of it, is like around 500 kilobytes. Ok. So if you try to block your loop content to this, 500 kilobytes usually get a good l2 hit rate and in order to see if these kind of transformations are successful is in tuning them, you can.

A

You can use V tune to check, for example, l1 l2, miss rate right. You can just look and see how big the Miss rates are and then adjust the block size, for example. Accordingly, then, the other thing you can try is short loop unrolling, and this is basically just helping the compiler to vectorize what you want him to vectorize. For example, you have this nicely collects loop with a collapse, free statement and then in here you have some non calculation over like if, like your free vector.

A

So if you leave it like this, what can happen is that the compiler sees ok, nice. This thing is collapse. Oh wait is another loop here, let's vectorize it, so the outer vector eyes are my try to vector is dead thing, but then it has a trip kind of free, so you've aced a lot of vector names. You don't want a vector eyes here.

A

You want a vector s here right because using these, these indices might be big, and if you don't put the Cindy statement here, the compiler will probably take this loop too for vectorization as a target, and that is not great. So what you need to do is to unroll this loop, and this is only true kind of free right.

A

So not a big deal, you just just unroll it, and then you're done especially with this loops with, like, like trip counts, which are not an integer multiple of 2 or 4, or something they are really active. Compiler does sometimes does partial, unrolling and and these kind of things and usually spends a lot of or in a lot of cycles, on on, like stitching DD these loops back together.

A

So, in the end, what you will you will find is that you have a lot of lot of overhead due to that, so you can also, if you do not want to do many unrolling, you can do unrolling prac mass, they are usually not portable, so I think the ending is a different one from jib from glue, but in technically you can just just put in like a pragma, unroll and jam or like directive, unroll and jam, and then will do that, for you also check the compiler optimization process when you do that.

A

Sometimes it's especially like the bottom, the very button, because when it goes through the fluid for the code and a compiled it says, yeah, this loop has a vectorized. The suppressant vector has vectorized. In the end it says: oh wait: I did not vectorize these loops because they are it's like probably inefficient. Something like that. So definitely check the compiler output. It is vectorized, two loops. You want them to vectorize or use Intel advisor it. Basically, just it's a very fancy way of parsing these.

A

These optimization report and presenting like suggestions to the user, how to properly vectorize certain things or use, for example, Cray reveal, which I think there is a laser presentation about this later on right. So there are a lot of tools which help you help you here.

A

So one thing about vectorization data alignment and you should align and pet data. Fortunately, in Fortran for example, gfortran does it automatically for you if you use the inter compiler, you have to tell him to do it. So if you compile your code and Fortran, try to put in a line array, 64 byte, so that all the rates are nicely aligned in c and c++, it's not so easy. Unfortunately, so what you have there is like this aligned, a lock or in I, think this is glue, attribute, aligned and then for info.

A

You have two speckle spec aligned 64 this. This goes directly to the either dimeric statements or the new and then in c++. You can play the trick. Just overload the new operator with this aligned alex and you're done right. So that's usually a button see you don't have this option. So that's you can write a function like overloading by the function and use this, but you technically have to go for the code.

A

Yeah data, panning I, don't talk about this more like advanced things. If you have really like, if you really want to go to the peak flops, you might need to pack certain race because of cash. Acitivity conflicts. I! Don't want to talk about this in more detail, but I.

A

Think those who I think this is more like an EXO technique. Okay, so another thing is: make use of the new intrinsic scent and especially help the compiler to generate those okay. So this is a loop or this is a kernel. Actually I didn't found it that way.

A

I found it already in the tweak version, but this is a kernel you could technically find in in there in the wild looks like an arm is moving kernel from a multi grid code and it's even odd preconditions, so what it does it iterates over there even sides and then over the odd side, and this is an inefficient version. So we have an affinity statement here and inside this this vectorized loop. We have a, if condition so the sort of the KNL can mask out. So it can detect this this.

A

This condition this conditionals and mask out the flops.

A

But what will happen is that since you trade over every eye, it's basically half of your vector lines will produce something results which you don't use right and when you compile this application with this kernel, and this kernel is like Edison of the runtime you get like in one point two seconds of runtime, but when you do the following, when you remove this, if condition and instead put it on top and compute an offset and iterate over to write, it looks like it's very, very stupid and easy transformation, but this is vectorized nicely, because now the compiler can generate like gather intrinsic.

A

So basically it can. It can take like only the even elements and then gather them pack them together in to field of fill. The vector unit, execute this and then basically omit it and inject it back, and this is this is much more efficient. For example, in this case, the app runs in 0.8 seconds, so get a 1.5 speed-up just by this tiny, tiny thing so watch out for these kind of things, when you vectorize and really try to play around to shuffle this conditional, so first get try to get rid of them completely.

A

But if you can try to do these kind of transformations to basically reduce those, that's also on saying, which is called reduced position map so like. If you have a lot of transcendental functions of square roots, Exponential's whatever they're, expensive and they're the canal, the ICEA has a so called reduced, precision variants of that and you can enable them by specifying floating point models. Hat equals two in the compiler, with no precise this during compilation, and this can help you so don't expect too much from it.

A

But if you like very heavily using these kind of things that might really help you also. What we found is a funny funny saying if you have like something like this, you divide by a consonant in a loop: don't do that just define the inverse of it and multiply with it. It's funny that a compiler does not necessarily pick that up right, especially if you have like a like a bigger code, so this dis sometimes gives you like 10% for free something like this.

A

But it's really like weird, of course, reduce precision, for these kind of things might not always be acceptable, so they have to decide what you want. So reduce position means somewhere, in-between double and single precision.

B

A

So what are the benefits of this of the avx-512 instruction set? So this is what we found in the new sub codes.

A

So this displays to speed up when you compile the code of avx-512 versus compiling the code of avx2 for the optimized codes, so they already nicely vectorize, and so the median speed-up we get is about 1.2. So like 20%, okay, but there you can. The benefits can be much larger, for example, here right that can have multiple.

A

So you would expect my easy effect of 2, because the electrons are twice as wide, but usually it's not the case right, but it can be larger because of more efficient memory operations, because avx-512 has this nice a vector, loads and store precious like scatterer gather broadcast these kind of things and summer, and most of them are not available AVX and those help you for example, or have the prefetching engine like the prefetcher and the cpu to work more efficiently.

A

So even you can technically reduce memory latency by using that, of course, fx5 service automatically enables when compiling for KL. So you don't have to do anything so here we had to manually disable it in order to test it use MCD RAM, which I talked about this already I I can recommend just always use it. Don't don't don't don't don't think about not using it. So this is the alternate cross section of an initial codes. This is speed. Apple get to the gray bars.

A

Are the MCD Ram, the speed-up you get when you run your code from fully from consider or like family cd-rom versus running it from DDR, and the red gives you this so, which is like for the solid gray might be either flat or cash. Whatever the best configuration was for that code, and then we compare additionally, what is the benefit of going to flat instead of off cache mode? So what you see is that MCD, when always helps right, there's no case where it really hurts you.

A

So this is like, like a like there's no good speed up, but technically it what it won't hurt. You so always use it flat versus caches bit more mixed back, because if you want to use flat, you either have to do this manual, placing the race where you want them, but then you have to also think about which arrays to place.

A

I cannot recommend using this numeracy to your presearch, because what will usually happen if you spill out that used using many codes like at the beginning, you initialize a lot of different arrays for maybe setting up the whole thing and the, but the really hot erase you work with come usually later on. So it might be good probability that those will be allocated into into into DDR and that you don't want.

A

So, in my opinion, in this case is use, can use use just the cache mode or, if you fit into, if you fit into 16 gigabytes, university or mmm and then it will error out, if you, if you, if you will run out of memory, so at least you know what what happens, and this is why mostly I think these kind of links the flat performance is worth because they use numerous et al P.

A

That didn't worry about putting the erase like like according to how how they need it in a distributed between MCD, Raymond and DDR. Just use this P options is why they perform so badly. So then, in that case, just use cash.

A

That's one note on heap allocation, kennel memory cache and compare be slow. It's not super bad if you have like a normal code, but some codes have like kernels, where they allocate a lot of lot of arrays and you locate them and that's very bad practice. So if you have something like an iterative solver and you in that Pro in every iteration step, you do a lot of memory allocations. Don't do that move them out. Okay, so just move them out as far as possible, so that you allocate technically just your whole stuff.

A

Once if this is to involve a huge code framework, you can try, you can think about pule located libraries, what they technically do, for example, into TBD. They overload you may log or whatever, and and basically, instead of really allocating memories as an operating from memory. What they do will they will ask the allocator to hand out memory which was pre located like so that's much much faster and the good thing is. You do not need to change the code except for linking properly.

A

The bad thing is that you need to know the memory footprint across us. You allocate a pool and if you run over mostly it will just crash without of memory, and the code is less portable right. So that's that's the drawback, so the best thing is just don't. Do a lot of allocations in the fire. My head for multi nodes, yeah, so they're not assets, not not so many things you can do, but one important thing to consider is that a single K&L thread cannot separate the arise injection rate.

A

So you get, you do not get the full bandwidth. So this is a plot from the some multi rank. Bandwidth I. Think it's just a ping pong thing so, like you have two nodes sending message to each other, and this is for one rank, / no to rank for node, 4 and so on, and this is the the bandwidth you measure, depending on the size of messages, and what you can see is that so this so discuss P is technically a protocol change.

A

But what you can see is that you flat out right, but also like one rank, is pretty pretty bad in terms of bandwidth. So what you want to do is you want to run more than one MPI rank per node. If you have this kind of pattern, because it can get you a lot of lot of benefit in this region, especially of course we want to.

A

We want to like a cartoon us to have to make a mix model between open, NPN MPI, and this looks like we encourage users to do opposite, because the 64 ranks per node. You only get the maximum benefit for block messages, but in that case you can, for example, think about using a threaded communication, MPI and I. Think with MPI, for they are even more discredit. Communication will be even better and that that would be definitely good opportunity to to really really saturate the memory bandwidth.

A

This is I think mostly because the course of the week, the single cross so week that that you need a lot of them to basically get pull everything efficiently.

A

So I can recommend more than four ranks per node like let's say, four ranks: eighth rank, 16 ranks per node problem and, as Helen said, there dedicate course to the operating system. So if you, for example, use four eggs per node or eight, you cannot like divide. This sixty-eight course nicely, so that you end up with like like is that you do not split tiles.

A

So that means you can just assume that you're 64 course and dedicate, for example, two cores to the operating system right and that's usually good choice, because then you have all some error mitigation, like noise mitigation, so use pages reset it again. So huge pages, what they technically do I think they reduce translation, lookaside buffer misses in the arise when it translates addresses and how to use it.

A

So you basically load one of the huge pages module at compile time just one, so it doesn't matter which one right you you just need to have one loaded and at runtime you load the huge page module size. You want right, so you can load you can at compile them. You can load the two megabyte one and runtime you can use any other. So, for example, that's that's nice and there's also some some MPI mph event variables.

A

You can try out, for example, if you've codes, which are very collective, heavy non blocking or blocking this matter, try to use the map so at LD map. So that's was for static linking and for dynamic linking for static, linking it's a bit more and more more elaborate, but I can get give out the screen and then try to export these variables here.

A

What they technically do is they'll activate remote memory, access over the schemas library, and then they tell D map to basically to grab the collectives from MPI and use the D map library to execute them, which then use RMA to be executed, and this is also for certain collective operations. I think that there are somehow done in the network buffer.

A

So these kind of things when you set them that might in certain cases give you like up to 20% speed-up at least is what I've seen if you do is MPI free single sided, remote memory Atomics, for example. You can also try to set this guy. This is especially cool for small messages, so it feels like in accumulate, get accumulate or whatever operation, then for an integer or single float or double or something you can set this and what it will do.

A

I think it will use the hardware to to to do this to do the locking. So this is this can give you up to a 20 X speed-up, but only for smaller messages. So if it's like so later, latency night might be 20 X better. So if you have bigger messages that might actually hurt you ok, so last thing is know some notes on I/o.

A

So this is the right bandwidth in megabytes per second on hassle and KL. So please ignore the stream can direct IO. So this is for the buffered I/o and what you see it's like KL is 2 X slower for single core, ok, so that is that doesn't sound great for people who do a lot of my own. But there is a solution and everything on can nail just go to multi-core. Ok, so this is the bandwidth you will get with notes and different course per note right. So this is the bandwidth on on this.

A

Is the I think one is right when it's read I think this is read. This is right, very core Craig, it doesn't say it actually I tweet readers right here so and what it tells you you see that at a single at a single core cahnnel looks very bad. But if you go to multiple cores you can even outputs long hassle, easily and I mean I. Don't recommend like going like to 64 64 way, frets, but technically, if you just use like 8 or 16, you're, always better off.

A

Okay, there's one problem and I talked to our I/o people like Jen, Lin and Glenn. There are no good fredita with solutions available, so either you implement your own or you I. Think I, don't know if there are some issues with POSIX or you use, for example, multiple processes, and then you can use, for example, MPI I/o right dude I also do not do not do the following this. If you want to write out data gather every single node, 0 and write it out, don't do that and rank 0 and provide out. Don't do that.

A

Please, yes, try to use MPI or, for example, HD 5, which use MPI or under the hood for parallel all right. So that's the best take-home message here: try to paralyze you I/o! If you are really having to write a lot of data and then, which is always a good good thing to do, especially for lastra write, big chunks, so pull your I.

A

Oh, don't don't write small chunks once in a while like try to pull it and then write a big chunk, because every time you write small data, you might pinger the last remit, a data service which really slows you down and reduce the file operations, so the open and the closes. So don't don't like open closed files. All the time I try to try to reduce this and, of course, try to reduce the whole like the whole file, the whole number 5 to 1 right for large files.

A

So it's not to tell when it will help you, but for large files like other 10 gigabytes and hires try to burst buffer. So this is a completely different world. It might be different tutorial and a sound said that you should try it out. We have a link to page. So if you click on this this, this taxable, you will find you would bring it to the birth locker page and you can just try it out. It might help you a.

B

Little bit I'm I'm a little surprised that I only got further relief can I see them. Yes, so I would expect like much more different in the raid array or I mean I expect much over here now and in.

A

What it's like this, the scale is different its pounds of mm. Oh.

B

No I'm looking at the left side. Yes, so those are really rate- is being comparison between care now and handle right, yep.

A

So those can all be careful, they.

B

Are there to close I think dental is no surprise. It looks.

A

Close but I think it's actually not that close. It's like the scale of that is much more.

B

Like normal to me, this.

A

One's here, yeah I, don't know this I got the data from from here. I believe that somehow and I think Brian checked it, and it said that that looks okay, so coming back so I talked about all these optimizations and now you think: okay, that's the stuff help and for we applied these optimizations for this code. So this is like the before Dhalsim ization right. So we applied like selections of these optimizations to these codes right and maybe a little bit more so more interest about this.

A

You can find on the case studies I will post the link later. So this is how it looked like with a medium speed over 50% on Edison and on half. While you were slower and now after that process, we get a median speed up, and that is now the optimized code on half on sorry on K&L versus the optimized code on Edison, so you get in 1.8, X, speed up now and median and on hassle it's about a spot, even okay.

A

But please note that these optimizations, also in in almost all cases, is sped up the the performance on Edison hassle as well. So that's that's important, so that's not! It doesn't show that of course right, but you benefit from for all the architectures. We can offer okay. So that's the summary.

A

So when you optimized code, go to single, go to focus or similar performance first and then try definitely try loop, fusion and tiling that helps a lot, ensure good, vectorization and use MC um all the time and multi node performance. Please try out huge pages, very simple right, just load a module and compile and try this team app stuff, which is also like setting a couple of environment, variables, method and fire performance are totally rewarded like try to use color or pull it. And, of course we use file operations to minimum.

A

So we have a lot of training material FR. So this is like all of them, a hyperlink for running jobs. How to do the fret binding for code, profiling and tools. We have a lot of tools to offer how to measure their phonetic intensity. We have a set of scripts which basically grabs the output of each one and and and ste, and just gives you the numbers you are interested in.

A

So that you don't need to go through the GUI and look for that and how it can improve up my piece scaling, vectorization, MCD, ram and also very important. Please look at the case studies, so you might have a code which is very similar to a code we already optimized, especially if your functions are extensive, based stuff, definitely look into the case studies and maybe in the literature there might be there's a lot of a lot of different things around which can really help you to optimize that yeah.

B

A