National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 1 Intro to KNL on Cori (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

Okay, so thank you all for coming. Welcome to the latest knights landing new user training. This is the last kml training that we will do before the KL nodes go live before we start charging on Corey, so hopefully by then you all will be familiar with the architecture and the hardware and software and how to get your codes running well.

A

So for those of you, following along at home, the Knights nodes were installed in Corey towards the later end of last year, they've been in sort of a pre-production phase for since then so not quite a year. That phase consisted of no charge for users who were enabled for those nodes relatively limited access until recently, so all of Nerf's users should be able to use the kml nodes.

A

Now there was some intermittent downtime because we were making sure the hardware was stable and doing some software upgrades and so on, and so when it enters production, some of that will change will start charging. So the charge factors I believe are already listed on the nurse website.

A

If memory serves I believe it's 96 nurse hours per KNL node hour, which is a little bit more than the Haswell nodes, and hopefully there will be fewer down times more stable software environment and everyone will be running well, so KL, formerly called Xeon Phi KL is a code name that Intel used until they released it. Kl is the second generation of the Xeon Phi architecture from Intel, unlike Knights corner, which some of you may have used. It's a self-hosted architecture, so the operating system runs directly on the knight's Hardware.

A

There are significant improvements in both scalar and vector performance over Knights corner. There's this new unpackage high bandwidth memory on this last point. The fabric on package is actually not relevant to Cori. We don't use Intel's fabric, we use craze. So you could ignore that last point intel does release three versions of knights landing. We are the one on the far left, KNL self boot. The other two are part of other systems at other labs, but that's not what we have.

A

So this is a sort of a simplified diagram of what the Knights chip looks like so Knights corner was a ring. The cores were connected in a ring. Knights landing is connected in a 2d mesh.

A

You can see the little the MCD Ram is the on package high bandwidth memory, so there are I, believe eight channels of MCD Ram, six channels of ddr4 and the part that we have so this diagram says yeah. Yes, 36 tiles, which is 72 cores at I, love two cores and that's the part that we have no sorry. We have 68 course. We have 34 tiles, sorry and it has significantly higher peak double precision and single precision floating point performance over both K and C and over Haswell nodes.

A

This slide claims three teraflops in reality, you'll probably see a little bit less than that and unlike Haswell, which is a two socket system, at least the version we have on Corey. The nice nodes are one socket each and so they're 16 gigabytes of this high bandwidth memory, which can reach somewhere around 400 gigabytes per second of bandwidth compared to about a hundred of the ddr4.

A

So there's a lot more of the ddr4. We have 96 gigabytes on this part and only 16 gigabytes of EMC DRAM.

A

So again, the KL chip is built out of tiles. Tiles are a pair of cores which are on a single piece of hardware.

A

Each core has an l1 cache and then they, each tile shares an l2 cache of one megabyte. It supports up to four Hardware threads per core, whereas Haswell only supports two.

A

The instruction set is twice as wide on night planning as it is. One has well so has well supported up to avx2. Instructions which are 256-bit K&L supports avx-512, which is the name suggests as 512-bit it.

A

But an important thing to keep in mind is that Intel will give us and they rarely take us away when it comes to instruction set. So anything that was compiled on several generations prior prior to Knights landing will still run on Knights landing without recompiling. So, for example, you would not have to recompile Emacs or them to run on Knights landing, which is nice.

A

Is there anything else, so the cache is coherent across the whole across the whole chip, and that's about all there is to say on this slide. Other questions on the chat. Yet, ok.

A

So again, some of this is a review and I, don't think, there's anything new again. This last point is not true about Cori. We do not have an integrated fabric or we do not have the omni pad fabric on al.

A

So this is a comparison of Cory. The Knights landing nodes on Cory versus Edison, which uses the Ivy Bridge see on architecture. Edison's are older of the two craze that we have in production right now. So this just gives you an idea of how things have changed is a very high level and things that users will need to be aware of when they're migrating from Edison to quarry, and particularly the Knights landing nodes on quarry.

A

So you can see that the node count has almost doubled, we're just under ten thousand nodes of knife landing along with twenty three hundred eighty eight nodes of Haswell, so in some quarry has around twelve thousand nodes total.

A

The core count has gone up significantly so Edison the Ivy Bridge architecture is a dual socket architecture, so there's 24 cores per node on Edison 12 cores per socket, but quarry has the knife: landing.

A

Has a single socket in sixty eight cores Edison also support the Ivy Bridge nodes also support two hyper threads per core Knights landing supports four, so the number of hyper threads that are active per node has gone up from 24 times 2, which is 48 up to 272, so the amount of hyper threads that you can keep running in total has gone up a lot. So that's all good larger numbers are usually better.

A

However, there are important things to keep in mind, such as the clock, speed, which has gone down by about a factor of two, so Edison sort of fluctuates around two and a half gigahertz the Knights landing in haswell's about 2.3 gigahertz. Ninth landing is about half of that between about one point, two and one point four, so the clock speed has gone down quite a bit, which means that, in order to get performance, we have to find that performance elsewhere and the way we find it is usually through more parallelism.

A

So we don't find it just through clock speeds anymore, and you can see that here we can actually retire significantly more in operations per cycle on quarry than we can on Edison, and so this is one way that you that you can get better performance on quarry, even while dealing with with significantly slower clock speeds, like I said before edit.

A

So Edison only had one type of memory, which I believe is ddr3 64 gigabytes per node, which comes out to about two and a half gigabytes per cord Cory has two types of memory, so the memory hierarchy is a little bit more complicated. Now it has 16 gigabytes of this on package, this MCD Ram with about 400 gigabytes per second and that 96 gigabytes of ddr4, which is 112 total.

A

So, if you're in cache mode, if you're running in cache mode, which is a memory mode that you can run in which we'll talk about in a little bit- that's 112 gigabytes, total of memory per node that you have available, which comes out to something like 1.6 gigabytes per core if you're running in flat mode- and you just want to use the high bandwidth memory exclusively- it's much lower. It's about 230 megabytes per chord. So the memory per core, no matter how you look at it- has gone down quite a bit in comparison to Edison.

A

But the memory bandwidth have gone up. So ddr4 is a little bit faster than ddr3, but the MTD Ram is significantly faster about a factor up to a factor of 4. So.

A

Again, this is a in terms of the instructions that you can issue on a KML node. You can see the two columns on the Left show the Sandy Bridge and Haswell our instruction, so Sandy Bridge I, believe yeah Sandy Bridge has supports up to AV X, which is 128-bit, Haswell supports 256 bit via avx2, and then KML supports avx-512, which is 512 bit, so the instructions of the vector widths are getting wider.

A

So that's another way that you can find more parallelism in your application and there are two of these vector processing units per cordon KL. So, besides the core counts. Going up the vector widths per core have also gone up and again.

A

Intel has not taken anything away in terms of instructions, so anything that you compiled on house well, even on Sandy, Bridge or Ivy Bridge will still run on KL. It may not run well, it may not run as well as it can, because it won't have these five, these avx-512 instructions, but it will run so for some applications, which don't depend very significantly on performance, for example, vim or Emacs.

A

You probably won't need to recompile those. So- and this was not true with regard to the Knights corner- Knights corner actually was a different instruction set and you had to recompile everything for it.

A

So I mentioned memory modes in the last slide, so KL also has a new feature, which is configurable memory modes. This was not available on Haswell or on Ivy Bridge, and what I mean by memory modes? Is you can actually decide how you want the high bandwidth memory, this MCD RAM and the DDR to interact with each other so the first mode, which is conceptually the simplest in terms of writing code to use it? It's called cache mode and, as the name suggests, this treats the MCD Ram as a transparent cache.

A

So it's not a separately addressable piece of memory that you can allocate and that you can allocate memory to it's totally transparent to you. So it's while night slanting does not have an l3 cache like Haswell. Does you can sort of emulate an l3 cache using MCD Ram in this way? It's not as fast if the latency for Dempsey DRAM is not as fast as a true l3 cache, but it's definitely better than nothing.

A

So again, if you run in cache mode, your application will transparently use MCD Ram when it can, which is nice, because you don't have to explicitly address DDR Ram DDR memory, vs MCD Ram.

A

There are some potential downsides to using cache mode, which I believe are on the next slide, but that is one mode, that's available to you and so in from a practical perspective, when you're running jobs in your slurm script and the little stands at the top, when you're choosing the partition and the time limit, and so on, there's actually a motors that you can specify. In fact, you must specify a constraint when you're running on a night when you're requesting knights landing nodes, you actually have to tell it what memory mode that you want.

A

In fact, if you don't, it will just give you something and that you'll find out at runtime what you got so another mode that you can use is called flat mode and in this case the MC Graham is actually configured as a separate Numa domain. So now it your when you ask the hardware: what memory do you have available? It will tell you: I, have two banks of memory: I have 96 gigabytes of this and 16 gigabytes of high bandwidth memory flat mode in terms of performance can do better than cache mode.

A

The downside is that it's a little bit more complicated to use either you need to have an application which has a fairly small memory footprint, because 16 gigabytes is not very much or you have to explicitly address memory, and there are tools for doing that. But you'll have to write in your application, specific malloc and free statements that say use MCD, Ram or use DDR.

A

So it's a little more complicated, but you can get faster performance and then a third option is hybrid mode. Where you can actually choose a little bit of both you can you can decide to have a little bit of memory and flat mode and then a little bit in cache mode. This is not a particularly popular model. In fact, I personally have never used it before it's quite a bit to manage, and so generally users will prefer either pure cache mode or pure flat mode, and so you can choose either.

A

You can choose any of these modes that you want when you run your job. The downside is that these have to be configured at boot time. So if a node in the mode that you request is not available, slurm will actually have to reboot the node, and that is I forget what the actual time that it takes these days is, but you can expect 20 minutes to half an hour. So it's not short.

A

A

So this is a little bit more about what I just said. This shows how the memory expresses it exposes itself in terms of Numa domains when you're running in flat mode versus cache mode. So on the Left. This is what the memory looks like in.

A

Let's see, if I'd get this right, yeah so on, the left is KL. This is in flat mode, so you'll actually see a separate Numa domain with 16 gigabytes of MCD Ram and another Numa domain. That has the DDR it. In contrast on Ivy Bridge or on Haswell, you will always see two Numa domains. You can't choose to have just one, because there are two physical sockets, so the general use case for running in flat mode.

A

If you're, if you need to allocate memory explicitly into the high bandwidth memory generally, you want to choose the memory that will be used most often in your application, so that the bandwidth, because there's not much storage available, only 16 gigabytes. You want to devote that storage, the limited storage to to memory which needs to be which needs to have very high bandwidth. So you can get good performance. So there are two ways to explicitly allocate memory into a high bandwidth memory.

A

One is through this mem kind library which we already have on Cori and there's a little description of the API on the next slide, I believe so, if you're writing, C or C++ applications. This is Genesis mem kind, libraries generally what you will use the other option if you're writing Fortran codes and you need to allocate memory into MCD Ram- is to use a fast named directive, which is an Intel compiler directive. It's not part of the Fortran standard, it is an Intel specific directive, but that option is available to you as well.

A

So this is what on the left. This is what mem kind the mem kind API looks like so men kind itself is on github. You can go get it if you want, we do support it as a module on Cory I, don't member if it's called nem kind or creme M kind or creme M kind, but that module is available for you to use.

A

So you have to include a I believe, there's a there's, a header file that you need to include, which is probably been kind H and then, in terms of what code you need to change. So, if you're using mallets and fries like in C code, all you need to do. Is you change your malloc to hbw, malloc and so hbw malloc will explicitly allocate memory into high bandwidth memory, I, don't believe there are corresponding new and free statements.

A

If you use new and free and C++ I, don't believe that mem kind supports new or sorry new and delete. But someone please correct me if I'm wrong.

B

Here that it does actually support you in doing a desktop.

A

B

Not actually expressing at least I'd better, yet LTS the kid that provides at Hughes and ingame, okay,.

A

Well, thank you for the clarification, so that is mem Khan on the left and then on the right. If you're, writing, Fortran applications and you need to explicitly allocate high bandwidth memory, there's this fast named directive, which is shown in red. So this deck is an Intel specific directive. So again, this is, if you compile with the Craig compiler or with new, they won't know what to do with this flag and I. Think they'll actually ignore it entirely.

A

But you just you just append this fast mem stanza to the array or the memory that you are going to allocate already I, believe fast, mem only works with allocatable, arrays and I. Believe that's true. So if you just have, for example, real a8, if you have a stack array, I, don't believe fast men will work. It requires allocatable, the allocatable attribute.

A

So that's one caveat to using fast men.

A

Okay, so I think Brandon will take over for the rest of this other other questions before we transition is anyone looking at the chat? Oh, it would show up here. Okay,.

C

Yes, the comment from the audience is just pointing out that there is that there are men modules available at nurse for so module avail. Mem kind will show you the options: okay, thanks Brian for the introduction to the architecture I'm just going to describe in a little bit more detail, some of the other architectural options in KL.

C

So, in addition to the memory modes, there's there's also a unique feature. Kay now is the distributed l2 cache architecture.

C

So, as Brian mentioned earlier, each core has its own individual l2, but these are coherent, and this is a block diagram showing the the interconnects the connection between the different tiles and and then every row and column in this architecture is a half-filled ring the I guess.

C

The exact layout of this architecture isn't necessarily important for your application, but the characteristics are a little bit different than a a completely unified l3, for example, and with such an architecture there's a number of different ways that you can configure the chip to to be exposed to the user space.

C

So the the most common mode, and also the default in almost every application, is quadrant mode. So in quadrant mode the Xeon Phi chip is exposed as a single Numa domain, but internally it's divided into four virtual quadrants, and what these virtual quadrants mean is that the the memory addresses are hashed to directory. That's the distributed directory in the cache hierarchy are their address to the same quadrant as the core that it comes from.

C

So what this means is that you get some benefit of of locality when you try to access some memory, that's local to your quadrant and in general this is probably the easiest mode to use in terms of performance ease of use trade-off. You only have to worry about one Numa domain. You don't have to explicitly manage your locality and you get a bandwidth benefit versus having an all-too all mode where there's no affinity between between the address and the directory.

C

And I guess I could describe so the in this block diagram. It shows what happens when at a core on topless tile with labeled one goes to access. Some memory address that it doesn't own. So the the request travels across the mesh to to a directory that tells which MC DRAM controller to look to request the memory data from and then that gets sent back across the the on chip mesh to the tile.

C

So what what quadrants does is kind of limits, the the affinity between where the the directory is located and where the actual memory is.

C

So another way to configure this chip is through the sub Numa clustering modes. So there's there's actually two different sub SNC modes. You can either divide the chip in half or you can divide the chip into quadrants and in this case, that division is exposed to the operating system and a user. So if you choose SN C to mode then it the chip appears is to two separate Numa domains, so that'll be familiar to anyone, who's use them dual socket Gionta nodes on Edison or the Haswell partition on Cory.

C

So in principle, these modes provide even lower latency for memory accesses, but the trade-off is, you have to explicitly you have to explicitly program and manage, and it also makes your can complicate your job launch scripts, especially because in essence, e 468 isn't an even multiple of four.

C

So you have to be very careful to to distribute your NPI ranks your processes evenly across four quadrants that have different numbers of cores and just to keep it self-contained I'll go over again the spectrum modes.

C

So, in general, the standard advice is to stick with quadrant and certainly worth checking the other modes. But quadrant is by far the easiest and and pretty much every case. I've seen gives very close to optimal performance compared to the others. So yeah, as Brian mentioned earlier, there's a number of ways. You can configure the memory, the odd package memory and the Xeon Phi.

C

So in cache mode, it's the easiest to use the user doesn't see any different memory domains. The sixteen gigabytes of odd package memory acts as a transparent cache.

C

However, in this case, if you are using lots of memory- and you have a cache, miss then you're getting a latency penalty because you've gone to the MC DRAM and now you need to go to the DDR in flat mode. It's exposed as a separate Numa domain that just contains the memory, and that can be you can place your allocations there through Numa control and Numa control or hardware or LS CPU can give you all the details of the exact hardware configuration that you have.

C

And then hybrid is a the mode that I yeah I. Guess as Brian said, I, don't think many people have have used, but it's a it's. A combination of the of cash and flat mode.

C

So just the the main choice is usually between cash and flat and both have their upsides and downs. So in cash mode. The benefit is that it it kind of works out of the box. You don't have to change your code or invoke any fancy or any changes to your run. Scripts and you do get a significant bandwidth benefit over running out of just the DDR.

C

The downside is that the the peak bandwidth you can get is a little bit is about I, think 95% or so of the the peak bandwidth that you can obtain in flat mode, and you have a slight latency penalty and you also lose 16 gigabytes of addressable memory.

C

So, if you're really trying to use as much memory as possible on a node, this mode will take some of that away from you in flat mode. Is the the most performant choice if you're running, if your application is using less than 16 gigabytes? So if you know for sure that you're only using you know 10 gigabytes per node, then you can run entirely out of MC DRAM and get the maximum performance possible.

C

And the the downside is that if you are using more than 16 gigabytes, your application will need to be modified somehow to exploit this. Otherwise you'll run only out of the the ddr and get significantly lower performance.

C

And yeah, I guess, is the final one of the final bullet points is, then you have also the headache of choosing what gets allocated. If you, if you're going to selectively allocate it can be quite a problem to figure out exactly which arrays should or shouldn't be put in them.

C

So, just a few more details about actually using the flat mode, MC dram. So if your whole application can fit sixteen gigabytes, you can just add Numa control.

C

You can use newbie control to bind the memory allocations to a specific Numa domain option beans using libraries like the ones that mmm kind the Brian talked about earlier and mem kind can enable some compiler directives for Fortran, or you have to rewrite your your bollocks and freeze very system Alex in NC code.

C

Then there's there's also this option of auto hbw, which is a base on mem kine, and what this one does is. Depending on the size of the malloc, which you can control you can you can do a kind of coarse-grained. You know if this array is bigger than a megabyte put it in MC DRAM.

C

If it's blessed just leave it in DDR, so it allows that sort of option and then the final option, which is generally not recommended, but you can, in principle, use direct Linux system, calls like a mapper and M binds to explicitly kind of bypass.

C

These are the library options, but this has a couple of drawbacks and you have to do a lot of work yourself to to efficiently use it, and so just the details of how that works.

C

So the MT dram is a different Numa node, so so the node 0, for example, would contain all your CPU cores and the DDR and Numa note one would have just the MCD Ram.

C

So the way you can control that is with Numa control and really the best thing is to use the to read demand pages for the exact syntax, but the most useful commands are Numa control.

C

Mine is hardware which shows you what you have and then Numa control, member and equals, which tells which Numa nodes to bind your memory to then there's there's also these preferred and interleave options and those those allow so preferred does allow you some control over letting your. If your memory allocations might go over 16 gigabytes, instead of just running out of memory, they can spill over to the DDR.

C

So option B just again for C you just switch malloc, you add an include and add hbw to your Malick's and for the Intel compilers you can. If you have allocatable arrays, you can add an attribute. A fast mem attribute to those arrays and they'll be allocated in the MC DRAM.

C

So otto hpw is built on REM kind and what you just have to do in this case is you set some threshold for for the size that you want to just split your memory allocations on between the different Numa domains.

C

And so what if you want to combine SMC modes with flat mode, then you could have up to eight Numa domains to deal with.

C

So if your launch command can be quite complex, but for one for for single MPI ranks, you can use Numa control and then specify a comma-separated list of Numa domains that memory allocations should be in and again Numa control. Minus capital H tells you about the actual Hardware layout that you're looking at.

C

And to see how all this is working with your application running, you can use Numa stat M, which will show the memory usage in each Numa domain and then there's also Ken standard Linux system, proc paths which expose this information as well.

C

C

And just in case you for some reason would like to experiment more and you have access to dual socket system like Edison you can. You can get some idea on your own of of using the difference or using kind of remote memory affinity.

C

So, in this case, what you would do is you would use, bind your memory to the you bind your allocations to a different socket in your system, although that KL is open to everyone.

C

This is kind of an approximation, and it's probably better to actually test on the real system and I. Guess at this point just like to open it up to there's any questions about the architecture.

D

I can ask questions in the comments or in the chat or Newton when I actually.

B

Ask a question.

D

Question so you mentioned it happening, you're a toothpaste, support what that like for around with those became available orally.

C

Yeah, so the question was was: is there a huge page support for MCD Ram and yes I? It works the same way. I think it at one point there. There is a bug with this. There was a bug with a specific size of huge page in OC DRAM, but I think that's solved at this point, but yeah you can compile with the crepey huge pages, whatever module and then use your application as normal.

E

C

Okay, so so a question from the chat is: what is the default memory mode? Well, parts of KNL be booted in flat mode and who pays for the lengthy reboot time? I guess that's three questions, but so that the default mode is quad cache.

C

Our system allows for reboots.

C

So there's there's not there's not nodes that are specifically dedicated to flat mode, for example, and then paying for the reboot time, I think maybe I'll uh yeah so that we yeah so yeah size.

E

Equal at we are better founders of division, Weaver as much as possible. I think people forever in two days and if there's so much, if there is current, there are such type of was available, you don't have to wait, that's the maximum. We can. We want to avoid being a Charlie. We do okay,.

C

So the user does get charged for the 30 minute reboot.

E

E

D

Only 21 minutes only.

D

C

The scheduler will try to avoid rebooting if it has to.