National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2 Using Cori KNL Nodes (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

All right, so my name is Helen key I am going to talk about using Cori KML notes, the materials of this presentation, I prepared by Steve, Genji and myself so outline and we'll talk about compile some channel basics, how to compile how to link how to use MCD Ram, and then we talked about how to run on Canal knows, especially about processes, thread, affinity and memory, affinities and available different cluster and memory nodes and we're sure lots of script, examples and general recommendations of running jobs.

A

Next topic is about compiling, so we'll talk about available nodes. Cori has two types of compute nodes. Has one node and Kay now compute nodes, so assume most users are familiar with already compiling running on hello nodes. We just want to compare two hewa knows what a curry can on out. Have it has much slower cpus and more teachers per node and also a smaller memory per CPU. It has longer vector lengths so with me to explore. It also has completed memory hierarchy, especially when you consider MCD, Ram and different cluster mode.

A

It's that is part of option. So I talked about one of the point is that so binaries beautiful can't as well. If you bring a house word binary, you can run it on Keanu already, but it won't be optimized because you didn't explore longer vector lens or didn't explore more of the threading options, but these runs are not vice versa. Because of the new instructions introduced in KL, intrinsic avx-512.

A

So on Cory's the situation is our configuration. You may hear about some other channels. They have your the ability of compile on Canon O's, even if it does, it will be much much slower. So we don't recommend that, especially for foreigners correct configuration, we we have a different setup of the OS image on our computers, so that compiling on Cana is not supported. So we have to do a cross, compile.

A

When you have to compile on the logging node so slide, so this slide doesn't show up well from converting comp and PowerPoint to to Google but Google yeah, but but if you have seen it in the previous talk, basically there's the MTD remote setup is relevant to both compile and run time and Corey's. I wanted to briefly mention it and cache mode of the amphibian setups that it's the the high bandwidth memory is considered.

A

The cache and it's always transparent to users and the flat mode is that you were wanted to manipulate exclusively that you can put your memory a code memory into the MPD Ram and there's different ways of accessing it. I'll talk about more details in the later and there's also hybrid mode that you can do half-half or a split 75% 25% different ways.

A

So when we compile remember our house wall, we always tell you our users use the creek provide rappers, because wrapper provides lots of good things for you. It automatically links for them. It links all the if you load some Christmas tree modules, they, you know clean up Cydia.

A

You know something the wrappers will note or link to those libraries automatically, and the wrappers will also find the MPI libraries and create scientific libraries their link for it automatically for you, so that users don't have to explicitly add those paths so in the deeply default, compiler or user environment is the Intel compiler environment. And if you want to use a different compiler, you are you to do it. You need to do. Is module swap program, env, intelligence, a programming and a-and to compile Fortran code?

A

You just say: FTN my co dot, F 90 CC little CC fuzzy code and capital CC for C++ code. So what it does? Actually one thing it behinds hood is that it there's a module called crepey II target.

A

So on house, where log you know, is crepey II heart as well, and the wrapper will link the X is perhaps our link X Corps, a DX for Intel compiler and a link basically interlink the the corresponding on target for the machine type into the wrapper, so that the binary boot is targeted for that architecture and for the core ekl. That module is, if you want to be a target towards the channel, you want to create load, creepy, Mike Kandel, and so this is what the next slide is about.

A

So now we try to build a cane or target binary on house were logging of it. All we need to do is that this is the best recommendation. We tell our user to do if you want to do that, you do module, swap scraping you as well creepy, Mike, Kandel and then just go on with the wrapper, and that's that simple, the wrapper will take care of the target for you, they're also linked to the corresponding K and our libraries.

A

If it's available, it could be Cray, provide library or could be an Newton or provide library. We have the module configuration set up, so that is outlook on matically for the that specific target.

A

So another option is sometimes the user want to build. A binary that can run on both target on, has well and found KL, so Intel compiler has a flag specifically means that I want to build a binary target for multiple architectures, so the flag is be putting a X, and then you put Mike avx-512, meaning al &. Co avx2 is targeting for as well, so that your binary now actually runs on both oxygen architecture. It is very small overhead, the runtime and I'll figure out now.

A

Your current target is an hour to start branch and, finally to run. However, we just know that it's not as good because, for example, some of the libraries under the hood, if you have when you compile, if you're under the house or environment, it will only pick the house world branch of that library for them from the inker and kale.

A

So the recommendation is not is still try to swap some audio at the previous slide mentioned. Also, the disadvantage of the second option is that it only works for the Intel compiler now about how the link, so the rapper we already mentioned are the links in the libraries. But this topic is about whether you should link to a helper library or link to a kennel library.

A

In fact, not all the libraries or the credit provide software or nurse softwares are reboot for KL, because most of them actually don't make big difference so, for example, not performance, critical libraries or as an I/o libraries. Basically, we just reuse the Haswell library.

A

Sometimes we build, we would put a pseudo folder there look, but it's actually a link to the truck I. Just the link back to you can l22 as well. So most of those will be actually still linking to the hassle library and if they're performance, critical libraries, then you want to link a link to KL so either Cray would provide those or no squid providers and so be patient. Not not everything is ready, but if you do need something you know you can stand there the ticket to us and can work on us.

A

um You know in higher priority, but some of them, for example, are like ray lips.I, fftw pepsi and the third-party scientific library they already have a special beautiful canal and for for mkl it also has talked a no target optimizations. But there's a special note that when you want to use it, you want you need to link with lignum kind. So there are two kinds of length: dots that we'll have another slide later the point of that these users. You do not need to prepare libraries canal.

A

Only for your own application, you, you would use the margin swap or the AVX the ax option, to beautiful KL, and if you need some space libraries, let us know you can help.

A

So this is the convention of nurse software. We use user, common software and a software package, the version number and then either has well okay now and then the different compilers. So that's there's an example of pepsi 3.74. It has both hsw and KL and with Intel compiler, there's also a different compiler other compilers.

A

So these libraries, mostly built within the ax ax option for at least for the Intel compiler and under the hood, at all, to choose the correct target.

A

So if you see sometimes you see another package, it does show care now it might be a link, that's okay, but it's at least for user. The whole building process will be smooth.

A

So there's one issue related to cross compilation is that some of these opera beaut systems, you need to run a small test system at during the blending process. Freedom we needed to put up to generate a make file, so that will break because that binary, you're already beautiful KL. It won't run out as well. So there are a few workaround to you of those when it's all account what you want to do is you do your configuration step still under the has well environment? And then you do your small test program there.

A

After that, you swap to the mighty NL and then the make you will actually run with the flag will compile so that you get the correct binary and will see. Make Korea has worked out with the C make vendor that thinks versions. We five zero and there are two there's just a simple commands. You export creo S version six and then you define see, make system. Nickname is cray, Linux environment. Then there's no concentration issue for that right. So we talked about basic, compile basic link. Now I want to touch upon the MCD ran.

A

So, in theory would be talked about in the programming compiling part and also about also again in the running jobs apart. Part for compiling part because depends on how you want to use in different modes and how you want to link, because some of the way of using it, you need to link to a special, specific command kind library. That's why it's in the compiling part, so just to briefly review overview, the MC diagram. It has two ways to use it cache mode and flat mode.

A

The cache mode talked about it, so the cache mode, basically no change to make encode in your build procedure or wrong procedure. It's it's all taken care of by the OS you transparently and it's free. So that's why the cache mode is actually our default hooked from many most users, but we also allow users to explore different mode too. So, if you want explore explore flat mode, what you want to do is I want a few ways to to I want to use mg/dl.

A

One way is say: I want to put my big data into mg/dl. So if you know it's MTD ran on, the node is only 16 gig. If you know it's 15 16 gig, it's great, then you just say as long different options and then you mercy, gee, L, M 1, so because the one that Numa to do my earnings I want to force all my memory allocation, a Numa node 1, which is the MCU m in the flat mode. There's another option for estarán. Is man bind you say, map from exam one and that's it?

A

But if your memory is over 16 gig and use this way it doesn't, it doesn't fit it'll fail. So another way is: if you know it's too big of you not sure you can say: P means preferred and it's an N and s none option s. One option preferred last man as well, and the other option is explicitly say: I know of my memory. Usage is over 16. Gig I also do not want the system to just put my first 16 gigs into there. I want to choose.

A

I want to use my heavy big data into it. So what do you want to do? Is you? Can you know, use HP, W malloc to replace malloc and is a that's that the C C way of using it and fortunately and four different compilers is different directives, so the first line is the great way if they deep band director memory and bandwidth and then you erased and for Intel is say there aren't attributes fat man and your arrays.

A

So the way it does is that it only works for the dynamic, a lot allocated arrays. So your stack of variables fortune pointers, cannot go there with this method. So now this is how you change your program, and then we should talk about when you use lips to mankind. How you compile go back to is that one? How you compile and the two ways create, has provided a cream, M kind module, and you know that module and then again use fgn, OC c compiler wrappers. The wrapper will add these flags into the beaut command.

A

It'll actually add dynamic. So then, you're you know generated outbound. Binary is dynamic. So by default before without the stash dynamic, which this is the jet as a default way of the compiler usage. Most of applications are viewed static, so you may have lots of static libraries lying around so building by as dynamic may not be a you know, preferred way. So in that case the nurse has Amalia called mankind and it links with mankind J malloc without dynamic, so that you can still build a static binary.

A

So I talked about putting into the to yourself chosen variables into HBM, and then how do we choose so there's a tool. I'll lead. You there's also a return. The memory access collection option so that it'll help you to diagnose, which one is a good candidate for putting into them the MCD Ram, so I think then she's going to talk more about reaching later, okay and Brendan, and also to touch upon this morning as well.

A

So there's also another one and called Auto HD WM library, and for this one you don't have to identify which one to put into a CDN. But you say I give you the criteria. If this you know variable is bigger than what size or whatever size is say, a bigger than 4k or my size, the perching 4k and 8k. In these examples, then I just load module and then use something like Mary was to set that and then programming months and all at one time and a kick-out take care of putting those into MCD Ram.

A

So basically, I've talked about all the things about I want to talk about about how to compile. Is that just a summary these you build or logging nodes like you do now and you use provide libraries and if you need more libraries, let us know, and then the takeaway is that module swap crêpey has what and crêpey Mike KL. That's the easiest way to do build or if you absolutely want a binary to run on both architectures and for Intel compiler. You can use another amp.

A

Workaround alternative, then keep in mind about using MCD RAM or talk about more a little bit later. So any questions so far any question in the chat room. Alright, then I go about running jobs so for learning jobs. We want also always emphasizes that. There's so many core. So many you know the 68 Corazon, Aquino, node and therefore hyper stress. So, where you put your process, where you put your threads, is called affinity so process affinity. You find your MPI tasks to these CPUs thread affinity.

A

Is you bind thread to the CPUs that are allocated to that MPI tasks? That threats belong to you and the memory of these affinities are essential. If say, you put all your MP attacks on one core, obviously won't get good performance, so we want those to be evenly nicely spread out and then don't over allocate don't bind on two different threads. So that's that's! That's what we try to achieve here. Remember if anything as well, where you put your arm data.

A

Sometimes you say, I won't put my data close to my where my CPUs are but NC DRAM is a little bit different. It has no CPUs, but it's high bandwidth memory and and then also you want to know which gene with custom mode, what no Numa domain is and allocate. Accordingly.

A

This is the one thing we want to achieve: good affinity for surrounded processor and an affinity memory. Another thing one achieved is that we want to promote portability, so we we know that Intel compilers has some specific, specific settings. We try to see if we can find another more affordable way that that works for all compilers, that the open in p4 standard we try to to use.

A

Okay, just keep this in mind, and- and so we also mentioned about to so many different cluster mode, the so many different memory modes and they're all going to affect the Elm, the affinity there's cluster mode. There are the quadrant. We used more a lot. We said quad cache, quad flat. That quad means is the quadrant mode. Just you heard about in the first talk today and for those there are no no no Numa damages.

A

One domain is all CPUs in it, and then we have some new measurements and since e2 and SNC for there now we can start to see more human domain in the memory and different cache and such and when you say flat, you introduce another more when dimension of human them is I want to show you one of the utility called Hwa location basically provides Hardware locality information, so I went once you're on the compute no teach around us. This is an example. You've got some hard cash, so here you see, you know this.

A

Is this only one Yuma domain p0 is 90s for gigabyte and you can see that there are two core 0 0 and cool ones. They have individual ones, the fact that they share l2. So these two are a tie.

A

We try not to you know to allocate- and you know around pronounced, like bricks boundary of a type you can as much as you can, the decimal thing and another utilities, Numa CTL capital H for hardware, and then you can now you can look at how many CPUs, which CPUs belong to which Numa node in their distance. So this example is the 68 core quad cache and we see six to eight cores and we have times for 272, logical cores or in slurms way.

A

These each logical coil is a each CPU, so you have sixty eighty sixty eight cores and 272 CPUs and this is listed as below. So no zero has CPUs from 0 to 271, although so CPU, 0 and CPU, 68 and n, CPU, 136 and then CPU I think 204, actually 1 CPU, then in a full hyper thread on that CPU and then the quad cache mode. The has all the memories, the ninety six megabytes, memories of DDR memory, the other 16 gigabytes of MC DRAM- is not shown because it's chat is cash.

A

It's not the like explicit memory. It's not shown and there's only only one numero man or it's called Numa note. The numerous tense is zero itself to itself. Now we look at a flat note, the flat note. You will see that the very similar new Numa Numa Dino zero has all the CPUs and it has 96 members, I'm, sorry, 96 came back and human note.

A

One is the MC deal, and now it shows it has 16 gigabytes and then from 0 to 1, so 0 to 0 is this closer 1 to 1 is closer 0 to 1 is like further.

A

And if we look at another mode and the SNC it starts to have like to nuuma domains, split the CPUs in half from 0 to 33 is a new mo domain 0 and then from 34 to 67 is Numa domain 1 and each has six forty eight gigabytes and same for the HP MP DRAM is 8 gigabytes each and now you have a total of four Numa domains and the distance are as such, it's sort of a matrix from which journalist to which new model.

A

Basically you consider the salty you want allocate memories within each Numa domain and for each VM we would and the ways to run to allocating memory. We talk about Numa, Numa, CTL ways, P M options. You won't actually put in the correct neumann note ID there so that they are allocated to the high venomous memory.

A

So I mentioned about two physical cores share a tile and we want void task. This is a duplicate of side, so let's say: can I just run live. Let us run I want to run 60 mg I task. This is an example say I. What I do is I want to run. Sixteen MPI tasks and I want around eight openmp threads per task.

A

So this is like 128 total it's within the 272 CPUs available on node and 16 times a seems, a good number, and if on house, all I might have been using this along and it works fine. So on care. Now, let's give it a try, so I say: export OMP number thread equals 8 and there are two OpenMP options: I want them to proc bind yes, I want them to distress.

A

To bind to something and spread is like a don't want them to be cluttered together and then I want my OMP places to find to each CPUs. You see thread. There are some other options, but then I mean it was wizard.

A

Without that a my point is that if I do s1 s, + n 16, my cutable- and this is the report of the binding binding- you would read what you get is the 8 threads of tasks and MPI tasks, 0 to everywhere, there's all different cores, but remember we only have you running 128 threads, so if one of the 16 MP I already using how many tasks different cores you're, having like on the same core, two different MPI tasks on it, you know almost like getting this thing to different MP is sharing a tie to mess.

A

What we want is this very, very clear ones. So I have say MPI. This is one color MPI rank 0 and 8 threads and once I want them to be. So let me explain this a little bit more. So this plot. What I show here is the first line here that is all the red numbers. The physical course zero to sixty seven and then I. Give this number here.

B

A

As CPU harm, hardware's CPUs, our logical CPU, 68, 136 204, so that's actually one CPU, which is Hardware threads, and then this is second CPU 169, 137 and 205. So what I want is my first MPI rank. I want it to be on because I have 16 total an MPI ranks. A total of 68 course looks approximately 1 MPI rank. Should you use for cords, so one the MPI rank 0. To be honest, these four chords and then I have 8 threads. So it's rats I want them to use.

A

You know spread out the first two threads would be on 0 and 136. The second two threads will be on 1 and 137 and so on, and then this is. This. Color is my second MPI rank and my MPI rank 1 I want them to be on cores from 4 to 7, and then my 8 threads to be evenly also because I use spread I want them, spread out to be on these different Hardware threads. So this is what I want.

A

So what I want you to do is I said it is that without something it's so bad. So, basically, the reason is because we have six to eight cores and not divisive ah it's.

A

So all these learn or the of default Linux default findings won't give you a good finding. So we have to do something. The way to do that is you want to do. C and CPU bind I'll talk about more about how they set.

A

What do you set these for once you set these and then this is what we get so now you get NPR MPI rank 0 and 0 1 2 3, so this Hardware thread that I want it to be R, so the very general setting is that C is what you want to set aside for.

A

Each MPI rank the number of CPUs the number of logical CPUs in one set and applied to each MPI tasks, because- and it's basically is 256 because I'm using 16 MPI tasks I'm, giving it 16 CPUs, which is actually four physical, CPUs so 4 times, 16 is I'm actually only using 64 CPUs.

A

That's why I'm giving it 16 each and then I want to bind my CPU to two cores so that once I bind to course I'm. Allowing my MPI rank to float.

C

A

These cores, so let me show you here so I'm, giving it 16 means I'm, giving it full course, and all these hard words rest using these four cores I'm, giving this 16 CPUs to my MPI rank 0. This is what is this C 4 and then again, my next engine ranker also get 16 threads just like pre-allocated. Once you clear air, look at these number of CPUs and then, however number of threads as long as number of threads smaller than that and it all it won't over.

A

um How do you say over allocating and then- and you can- you can see- I have a thread. Only then I will try to spread them. I would have to thread to be on 0 1, 36, 1 and 1 37, but if I do have a 16 thread, I'm, ok I can have each thread to hang on these, although you can also allow for us to be. You know, you don't have to find each thread to each CPU.

A

You can allow them to migrate a little bit if you want that's like when you say Oh empty place. This is the OMP places. If you want to say it to be threads, then it's to one CPU. Only if you want to say on P places equals course then you're allowing your threads to be floating within these cords. That's the choices. You can also bind to sockets other options anyway.

A

So this is basically our essential points that you want. We want to our users to use C&S, CPU options, buying options and CPU bind. If we are you, then the MPI cores, if you're not using more than 64 cores, you bind them to course, once you're using MPI ranks over 68, then you want your MJ CPU bind to the thread and another extension run times. Options we want our users to use is to use onp prop bind in are empty places.

A

Initially we tell users to the ETL basically defaults to in trying to spread and create compilers as well, but you can do compile, has different usage of that, so it won't. It won't come having consistent you like layout for users, so we recommend users to use true, and then we also filed a bug to Google compiler. They have things fixed it, but it hasn't been released yet then Oh empty places and you can use different options.

A

So these are two so the first slide about. If they run some settings will process for an affinity, and this slide is about runtime settings for the man, engineer and memory affinity. So I think I have already talked about these points. The rep, if it's over 16 and yeah, the dash M or just without preferred means and is strictly enforced, and if you happen to be over, your application will fail and is a preferred option down here.

A

Right this slide, we want to talk about. How do you request different modes? So we mentioned a few then actually there's a combination that you can. You can choose whichever and we set aside some most most of our null and over six thousands of nodes are fixed at the quad cache mode. We have about three thousand nodes that are allowed to reboot. So once one um requested, you can say capital, c, KL and then different in new month mode.

A

Quad essence, these two others and then the MCD ramp and cache flat split, although I do want to mention essence before is not generally recommended, simply because it was that you have. You know for newer domains that they're not even because there are total of 68, so you get nine nine, seven, seven tiles, a ninety nine, eight, eight tiles per new middleman, and it's going to be a mess that because, when you have service, 34, MPI tasks and 36 in your tasks speech- and you cannot easily say, there's another option you want to use and.

C

A

Visualization I'm going to mention about it later we try to say I want to just use the extra coerced for the cost of reservation. The but slurm doesn't support it nicely. So basically, we do not recommend SNC for others. You can say okay now what other options you can do, SNC to split and then most likely, if you do that, you know your job will end up in the KML reboot partition, which means, if there's currently no such mode note available.

A

You will have to reboot wait for a rebirth and it takes about maybe 20 to 40 minutes and putting a program or I think, depending on how it works. Sometimes, if one of your the node would already will will be harnessing so so your job would be assigned to the reboot partition and it will paint your weight and once a job would be nodes will be allocated to you. The slurm is not smart enough that it all still allocated to you.

A

It means not in the Mon you desire, then you getting get into this configuration stage. If you do queue, monitoring command, xq, something you see, the state shows as CF means configuration and in the demon list, num of the node list that you have got already, but then job won't start and but but the good thing is that it won't consume your wall. Time request say it takes 14 minutes and your water request is only one hour. You still have one hour to round after the configuration is completed.

A

Okay, now I'm going to show you a lot of the job scripts examples.

A

The first one is the MPI in cod cash and a few flags just wanna mention this example shows one knows through the petition to the regular partition one hour and you want you scratch license and you want use the reserved two cores for specialization. I'll mention that later in the later slide, what this is for and then the most important things you want a quad cache node.

A

So, even though this is an MPI code, we still recommend you say: oMG numbers equals one. It is because some of the because, when the wrapper compile it is sometimes thinking to multi state libraries- and you don't want those two to by default- most multiple live like Intel would use whatever number of CPUs to 72. So that's.

A

Alright, so now you have for this example, we have 64 MPI tasks and allocate for CPUs for it, which means actually means a one core one core. This is rank 0 and another core. You also have 4 CPUs those event 1 because of you bind to core, so the MPI tasks will bind to this core, but it's also actually can't freely move within these core if, if CPU thinks and it's needed, so this is each core drink or actually it is a pile, but it's okay for for n-type um GI ranks you have.

A

You can have two different MPI ranks. The thing is we do not want. A merge from you know: rank zero in : and a thread 0 from full-length rank one in there like, like cross cross boundary. That's not good! So the two purely empirical here 64 3 4, and in this case the the last four cores is not are not used. So.

A

Here, I, don't I'm almost the same as previous slide, but now I have 16 inch eye tasks so for 16 avatars. If I want them to eat more evenly distributed I give more CPUs to it. I give basically for physical CPUs, which is 16 logical, CPUs here, so the core 0 1 2 3 is Meg 0 and then all the way down to around 15 is down to 63 and again. The last four are not used.

A

Okay, there's a still quad cash, so it's that the command is pretty easy.

A

So this one um now I want to use different number of OpenMP strands so want to use four threads. Secure, I haven't used it before this one just at number of threads, but then up for 16 to 64, M, PI X, so again, I give it C. 4 means that reg 0 will be here and then smooth, read, there's no opening piece about binding at all. So the thread or you know freely flows within these four CPUs.

A

So next one then I give something OMP bind equals true or empty places equals threads now I want to. This was to be binded and I want each CPU actually pings to a thread so with the four again 64mph half and four threads that each thread so thread 0 for task MPI rank 0 would be this one and threat one of my rank, 0 or binding 68.

A

So if we do, you know measure or check our affinity, you will find that in a report exactly as it plotted so, which is the way we design one it evenly and not over allocating in nicely distributed.

A

So this is pretty similar to the previous one, except as more threats. I have eight sleds and fewer MGI tasks, so 60 MPH asks 16 threats again I'm using the first 54 course only and I'm, giving it 64 16 CPU, logical, CPUs, but I'm only having 8 threads. So so, not if not all the cores with English will be used for 4 threads, because I'm still binding for us to each CPU. So you will see that thread. 0 2 thread 7 are binded.

A

That's like those pink colors these colors that those CPUs are actually idle and not used, because I have fewer threads, who eight threads but I have 16 logical, CPUs allocated to HMG I task.

A

Ok, now flat I think is the flat mount if I'm not allocating anything to MTD Ram, it's the way of finding everything exactly the same, and the command will be the same as well. If you don't use the MTD Ram, and this is thousands 64c, 4 or 16, you know, and so basically, what we recommend is say you have especially the number of MPI tasks. Is you know 2 to the power or something just?

A

Let's not use the rest for and then you so you have total of 256 CPUs to manipulate and you give it evenly to each impair task. That's how you figure out is the C value.

C

A

I, don't have a separate slide for that if I want to use MCD land, all I need to do is add to this option. My Numa CTL mm1 or my mmmm bind equals preferred something-something just adding to here for because and what I'm targeting is that for the quad flat mode, the numerator main one is the HP into the MCD Ram.

A

So just want to quickly go over. You know what does CPU buying and oMG places. I'm look like. So it's about an illustration, a a core. Actually, eight I'm, sorry a tile. So the top one is the core 1 and then 0 say the bottom is the cool one and core 0 have CPUs 0, 68, 136, 204 and coal. One has these.

A

So when you say CPU by an equals course means a CPU can migrate within these coding a core. But if you say CPU buying equals threads, then the CPU is buying into a particular thread and you say oh empty places equals thread again. That's nothing. If I have a OpenMP threat will be buying to a typical, a logical service CPU.

A

But here, if I say on keyphrases equals cores, then in my MPA I can't you have say four threads within this core, but then each of them can can float around or oh I could have even went thread only in that with that for that core. But if I say on p-princess in cost cores, that's right can still use any CPU during the execution.

A

So that's just a concept at the core is: how do you buy if you buy into cores? If you find a thread without the impacts? Are right, so I just want to give up this a different topic. Basically, we showed all these. You know batch scripts and everything. What about if I want around some interacts with quick runs and do a big, easy, debug things so the um capabilities available when it's debug I think you're familiar with debug.

A

The limits of running debug is the maximum 512 nodes, 30 minutes, and they also actually we have on Cori with okay. Now we don't have reserved notes, but I has well used. We and we have reserved notes from for debug, so it's easier to to get, but so far for curry and together give up. Node is not too hard, so we monitor- and we don't have reserve notes for this yet monitor. If is needed, we can implement there's a limit of 30 minutes and one job per user and you can queue up to five jobs.

A

So that's debug sometime. You have to wait, but the currently wait is not it's not too long. At all, there's another capability: it's called interactive, the way to use it is just another: its QoS equals interactive and very similar. However, it's much higher limitation- it's like per repo you can up to you- can use up to 20 nodes that and in a each user, can only run one job at a time.

A

Although you can't, each user can still request up to turn a nose up to four hours, but the whole report there's a limit of whole repos. So sometimes you can get some unknown, maybe because somebody from your repo is using it, but you can check how you know SQ command, to see who is using it.

A

The goal for this is that once you submit that either you get a notice request or you get rejected. They know to not available it's like immediate kind of response, so these are good for 44 for debug quick. Although those also, of course the limit is only up to 20 knows maybe you need bigger size to do your debug. Then then you can't use interactive, just use debug, okay. So next, a few set of slides I'm going to talk about some of the recommendations. Running jobs want to just pause a little bit.

A

If anybody has any questions.

A

Now all right so recommendations. um First, we want to talk about youth which use huge pages, so we talked about compile nothing. Is there just ftn or whatever? But if we're using huge pages, it's very easy. We just module load, creepy huge pages. The reason is that you know the default page. Size is only 4. K and lots of lots of tests have shown that using huge pages are beneficial. We want. We want you to try it out.

A

These are applications that we show that there are lots of creepy ugh previous modules available for m86 and all the way up to bunch of em, and we found that MJ Larry is already helpful if you can try other mores, but this starts on to em and there's a notice and main page.

A

You can check it out how more informations the here's, our plot vast of different number of nodes and the red color is the performance you get from using huge pages, not a big investment, but it's good achievement, easy and cumin to use that so recommend to use huge pages and another recommendation is people are reporting different? You know performance variations. The reason is that if your jobs allocated to different number of nodes- and they are everywhere in the routing- it could obviously affect your.

A

You know MPI communication itself, so you want to you know, control them to a more smaller set of cabinets nodes. So this a concept, the cost, which is basically about 384 1/2 cabinets in a switch. So the floor can request number of switches you want and maximum hours number of hours. You wish. You would like to wait. It's like sort of extra weight, but not necessary, because if your job fitting back here is okay, but otherwise you know I'm willing to wait a little bit longer just say my jobs can have a closer topology.

A

So that's, how is the the format? It's especially useful? If you know you don't have one switch and you now it really contained. Otherwise, you you can be allocated everywhere right and- and it's also helpful for you in your job- there's lots of communications.

A

So that's a recommendation. The other ones called zone sort. So this is a recommendation, but it's also mostly already down by default. Don't sort the issue. It's not called zones also, it is actually a solution, a a technology that applied to it. The problem is that, with the quad cache notes, the time goes on in the cache size that you, the the cache conflicts will increase. The reason is that it is direct map so that the addresses you know modular the cache size.

A

If they are saying there cannot be on the same page, so they have to go out and come back. So all these would affect the performance there's also it basically what it does that that's technology is di. Whenever might it now it's down by default and every time my job is about to start, then it does sort the available pages for you, so that you get the big pages first and then is the options you can turn it off.

A

You can try to set a number of different numbers of that in RAM in you know, frequently you as well, you can set up means, however many seconds you want to run so I think if you really want to try it because the on is already the default, but if we really want to try, you can use the third option. So that's also a recommendation and another recommendation introduced something called SV cast so in backing the old torque, more voice, our it's a default.

A

So you have our binaries in your luster or project project file systems and before your job is being launched.

A

They, the scheduler, will copy your excusable onto the image of each compute nodes so that the start time that from VD is something that all they all start about the same time, but without it they could have a big delay. So our recommendation is, if you have some job a job say bigger than 1500 MPI tasks. Let's do a menu FB cast I think we have a requesting for slurm that make it as a default. Not yet so what you do is they s peek?

A

At my code to some temp, the temp is actually on your own computer may move memory, and then you run your normal extra options and then the new mercy T or if it's our flap mode, and then you do not know your own actual excusable. You have to you know, replace it with the one you already copied over to the temp or if you're doing is numerous ETL, then you can do this in one step just as from be cast to this and then at the end you actually do you still use your own extreme.

A

So this whatever it does, is basically copy. Your x people to the compute nodes before your job starts. So in our cafe good start time, so SB cast and then I also mention about capital s. Anyone in those examples lives.

A

Capital S, is a basically I. What I want is I want to preserve how many cores per node for just doing OS so that the rest of the nodes won't be disturbed. So that's good, a good good concept and things I'm using say most basically for 4 cores out of six to eight anyway. Why don't I just use it so I think probably just say as two is good enough and some people would say four or one doesn't matter much things, but you can't. This can only be useful as batch.

A

You cannot use it for s Alec or sir, because the SL look actually our implementation. It's already a wrapper, it's already a s1, so it you can't just because the the core speculation has to happen before you. Try to you know, of course, so it only worked with Asajj.

A

Okay. So that's from our you know a set of different recommendations, not in the general running jobs, especially not in the has more money, jobs recommendation, so these are more specific to KL and then I also want to introduce the Wiener has a job script generator so, which should be helpful. This is in the minor risk page. If you go to my nurse and then you go from left side search for jobs, you probably have to expand it and you will see JavaScript generator.

A

It actually has options for Edison career as well and KL, and then it has you know, options you can give it some job name wartime partition or these things and then and then, if you want to say, quad cash cross flat and you want to see if I one is pure MPI or MPI OpenMP it'll give you a you know almost ready to use JavaScript it's more complete now than before, because before we don't say we don't have a job name, you would want to add you could go out to a today.

A

You want to send my email address. You want to add something, my output whatever. So this is a skeleton script you can use and we recommend you to try it out.

A

Then I was also want to touch upon. How do I verify the affinity, so we have many ways of verifying affinity system. We showed you all these cloth illustrations, but there's also you know the command line or some other ways. So I want just to talk about each of them. So Korea has given us something called XP hi.

A

Maybe using this a lot whenever we want to check if it's giving us good affinity, it's basically a hybrid MPI, OpenMP code and reports back tasks, zero and zero is buying two, which you know CPA or something like that. So for our users, we basically horrid in our preview, those. So what you need to do is if my code is MPI called always Michael is a hybrid code and if I use one of compiler enzyme on Cori of mo-mo Edison just pick one of the binaries and stuck into whatever way you are.

A

You have set up for your application and only just to replace your binary with one of those and to check if I'm getting the thread affinity I want. Then, if it's correct, it's all good, then replace it with your own binary. So you be sure to you're not being punished by in a wrong overall allocation, and things like that, and so besides using xph I, there's also two for Intel compiler. There's a KMP affinity wrong timing right variable, so you don't have to do anything.

A

You just set that to Bobo's and you can use your own application. If you want and I'll you know, report something to you and for create compilers or something similar called create, OMP check affinity. You can set it to true, but also for something to you for both of those. They don't give your MPI ranks at all, because these are open and P runtime environment. Only they don't know about anything about MGI. So the way to you know try to figure out what so have zero zero smart one.

A

How do I know they belong to the same anti to the hink is by checking PID if they belong to the same PID they have to belong to the MPI same for Craig. Compiler. Here is also CID.

A

So then, also for slurve, this also a CPU bind equals verbose. You can put in there and then check out CPU mask and form it for memory binding and is also memory by in equals verbose. You can check your memory affinity and also, as there's a Numa stat P your process, ID, and when your job is running, you can run doubt and check and new, more usage, a Numa.

A

You know information right, then I'm going to talk about a few useful commands for just you know just find things around how many you know nodes available where my jobs are things like that, so this is one basic thing is asking for format and give it format which it gives to. You is to view the notes in different features and the allocated idle other or total. Something like that. So in this report, what you get you get quad cache can l feature 14 total.

A

Let's look at total and you can L seven total cache chronic visited to key in our cache quad 78046 they're, actually, all the same type. They are, can yell podcast the order doesn't matter then, but then you know how many total you are they are, and you could also check in how many idle. If you look at the second column, there are only about two hundred forty idle currently right and now, if you see reports back to CTU, so it basically it's this the number of cores times.

A

272, it's a lot. So this the left side is probably more useful. This is one use command. It's just one point out: the other command is say you have a job, a node ID, and you want to know what type of this node is, because s control show node and I report to you active features of this node and.

A

When does node was booted or something like that or state? If notice allocated right now, thank state equals allocated something like that and also if they want to know a job ID, either your job or dropping the queue that you're curious about you can around F control, show job and that job, ID and I'll tell you the job, ID and the job state. This job was cute when I checked it was reason with priority.

A

It's not reaching this priority and when was the submitted, which petition submitted to it's actually action for something that needs a reboot. It asked for 32 nodes and word: were you know where this command is used, submit this job and was the job was submitted from lots of information? So you can check you can't sometimes you?

A

Maybe you want to save a history of your job, something like that and a few others s account can be used to query many things that basically, as account, is a command to query this lump database, and sometimes you want to see how many job that ran in the past I. Can you know my username something, and if you do s account man page, you can do you? Can you can list give a format you can list all my jobs. You know start time.

A

End time elapsed, number of nodes, all sorts of statistics you can gather from with account. You can account some other users as well. It's actually it's not just for your own self that you are unable to do and then sqs or SQ. This case shows the queue, jobs and they're called requested. I all the columns that number of knows QoS, that does jobs to have requested for or the reason why this job is.

A

Do you know in the queue something like that then, and then recommend you do some of these are main pages and finally, I want to mention so the few jobs use lots of I/o and burst buffer is not part of KL, but it's available so that very well Corey that that you can take advantage of to use first buffer for your I/o for to speed up and please check out the web page to learn how to use first buffer, I. Think I'm through two more slides about current queue structures and for volcano.

A

We have debug and regular and basically I mentioned about debug and regular, and basically there are number of you know more time limits, limit and number of nodes limitations like that.

A

However, things July first we're going to charge and- and we have already enabled all users- and we expect more users to to come to Cori. So we are adjusting and making changes. So a possible changes are. We cannot remove this partition. Everything you request will be QoS for s, equals debug us because regular or premium something like that, and then we will. You know we will give lots of stuff. We will allow you to submit more jobs if you want, and we will also change the bucket little bit the boundary one 160g could have.

A

You changed to a little bit number bigger number but watch for announcements. Nothing is in there yet and nothing is solid or decided, but that's just possible change your skin and you may want, if, for the food this one, if you use qos instead of P, you may want me to modify your script script, for that will allow some, you know great experience. So we talked about charging and how much it'll be charged. Is this okay now base charge factor of 96 x, number of notes, use x, actual wartime and it's an example?

A

If 100 no jobs two hours and will be charged, these many nurse calories and as comparison to edison anderson is 48 has 180 so again and once the I first comes there also another possible changes. Are we may give a bigger jobs, fire already boost and discount?

A

So look for announcement as well so summary for running jobs, youth cap, to see to request different types of notes. You want always use c and cpu bind with the strong command and also use openmp, proc, bind or empty places to fine-tuning your openmp threat and for memory access, FM, CD, ramp use, man bind or Numa CTL and also take to these other. The high level running jobs and and also I, want to like to ask you to take advantage of the JavaScript generator.

A

So all these things I've talked about or on these webpages you can read more and if anything, you want and ask for help just send us the ticket.

A

That's all I have maybe.

C

A

Oh yeah, they open up I.

C

D

It's like you're picking up a little tinnitus cousin rush to move us into the GTC. Sorry.

A

Can you repeat so.

D

Can we will discuss their master using this service GCC, because welcome back is n for.

B

General I think you see it works pretty well too right, except that the runtime has this problem, but.

A

This has been just been fixed. We.

B

Wouldn't yeah, you know about it right, you're going to do this.

A

It's committed it's somewhere in the in there really Holly we committed French, so I asked them a question. One was officially released version, I haven't referred back, but then then we could tell users to use. Omp profile equals bread for all three and compilers. Once we have that version, but other than that I call compiler is what we recommend is. Is that the Intel compiler? Is? You know it's for Intel processors? You know it's always a good native compiler to use, but GCC is also available, and it's like for lots of application packages.

A

If your application can use both try, both it could be always different and which, whichever one is better.

D

The other one, what do you have to provide how to use MSR safe color already? These are the.

B

Problems, that's.

A

All the driver settings- and we probably don't have it yet it's in your centers ticket and Candyman. It's mostly like the concerns of safety issues we have here. Yes,.

D

Like we I think you need to specify which fits our meeting writable and that we need to figure out first and it's not not a barrier and then can court for the fifties. We find a manage for jobs built with GCC.

C

Can which is both Express can Cori I'm sorry can core and Fred facilities.

D

Defining manage for jobs group of GCC Armenian exist. The.

A

Gcc has yet.

C

And then also which you hold Express I, think.

A

We had some some studies about that way and it was a yeah. It should be.

A

Any more questions yeah.

A