National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting

⏯

youtube image

►

From YouTube: 9 Case Study 4: MFDn (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

So I guess the the last case. Study for KL is the application called MFD n.

A

Mf GN is a nuclear structure. Theory code.

A

Many body yet many body configuration interaction, solver and the code itself is a it's a Fortran 90 code with a very small amount of C kernels that may or may not be used, but otherwise it's totally self-contained with its own custom. Eigen solvers and only external dependencies are our blobs and lay pack and it's an application. That's in use at a at a number of do-e centers.

A

So it's quite portable. It's running on Titan, mirror Edison and now quarry.

A

So I should also mention it's primarily developed by Peter, Maris and James. Berry out of Iowa State is the main developers, its application.

A

So for this code.

A

The the main science case for it is to to run very large, weak scaling problems, so in in that sense, is quite natural to develop a proxy code that simulated, because the maximum subproblem that we could fit onto a single node, and then that was our main use case for getting things like the vectorization right.

A

So we developed the full proxy code that does everything that a single node would do in the real case, except for the communication. That part was just sort of faked out.

A

So this is already been covered.

A

Okay, so this particular code is it's: it's a very highly scalable for threading in this plot I'm, showing the the the speed-up over a single thread for.

A

The five main kernels of the the code and it's essentially perfect scaling up to the number of physical cores on the system, and then we get a tiny little bit of benefit by using hyper threads past. That that's why it levels off.

A

So, okay into the the more detail, so the the main, the main phases of this program are you you generate some huge, sparse matrix for your nuclei. That matrix could be.

A

Yes, several several several billion nonzero elements and extremely sparse.

A

So this this part of the code is really kind of embarrassed embarrassingly parallel, every every process just needs to write or construct figure out which matrix elements that are nonzero that it owns and then actually calculate them. So this this part of the code is really mostly just integer operations, branching and lookups into yeah, reading out of big look-up tables for computing. The matrix elements.

A

So in in this actual code, the the many body states there's two different representations.

A

So one is a bit array where the the occupied single-particle states that make up the any body state are just represented as one and this can be quite efficient. If.

A

Just depending on the number of nucleons that you're dealing with and then the other representation is, is a set of 16-bit integers, which is just a list of the positions that are nonzero and depending what you're doing one representation or the other might be more efficient.

A

So a typical loop structure in this code is to to look through all of your column, States and all of your row states and then decide if that is a non zero element with a kind of a fast method. So you could use you could count the number of zeros in after you XOR. The two different sets, the let's say the first 16 bits of the bit representation.

A

You could XOR those and then count how many non-zeroes are left has a quick check, and then you, if you pass the fast check, then you do a more detailed comparison, applying the full quantum mechanical selection rules, which could be quite complicated and, depending on what phase of initialization you're in you might optionally, then actually compute. That matrix element.

A

So so yeah, so this is just showing the kind of the detailed more details of the one of the comparison methods and.

A

What we found in the code is that the that separate by manually tiling this loop to work in blocks of let's say, 8 or 16.

A

We can at least use the cindy or vector instructions to aggregate the loads and the XOR calculation.

A

Unfortunately, the hop count instruction on candles only works on integer registers, which has been quite a problem for us. So in some cases, we've done better by manually coding with intrinsics our own pop count instruction, but it's really tough to do better than an instruction, even with all the extra copies that you have to do, but yeah so manual tiling and then using openmp 470 directives to force the compiler to actually generate vector instructions where it should give us something like a 20% speed-up.

A

But and then we also took out a lot of unnecessary cycle or exit statements.

A

Those statements kind of hinder the optimizations that the compiler can perform. So it seems like you're saving work but you're actually getting in the way of more beneficial optimizations I mean at the end of the day, you're still checking all pairs.

A

So that's kind of our ongoing project to to get rid of the N squared part of of this phase.

A

And just a little bit about the actual format of the matrix, it's in a custom format called a press, sparse blocks, coordinate format, and that's that's due to a couple reasons. So in this code you have to do a a matrix, vector operation, but you also have to do the transpose. So if you do compress parse row or compress Sparx column you'll be very fast one way and slow the other way. But if you do a compressed, sparse blocks that are chosen nicely for your cache size, you can iterate through and either either dimension.

A

Just by switching the row and column pointers for the blocks.

A

And yeah for for more details of the actual data structure, there's a an IP DPS paper from 2014 which which detail has all the details.

A

So this is just a picture of this of a very small version of the type of sparse matrix that you get.

A

So you can see it's quite that there's definitely some structure, but it's not as nice and evenly structured as you might get from some kind of like a stencil code, for example, and then also because the main science case is to simulate as big of a matrix as possible to get the highest accuracy.

A

There's a lot of pressure in terms of no optimizations which trade memory for speed. So in this case, Roley storing, half the matrix, because it's symmetric.

A

So one of the one of the main things we did to speed things up on KL was get get rid of the SP MV and switch it to an SP mm. Let's say if you have a stencil, you're, applying multiple right hand, sides and then I guess this table just shows that the number of right hand sides increase. It increases your aromatic intensity by almost a factor of four, so much better data reuse and we're we're getting much higher flops.

A

Even though we're still maxing out the memory bandwidth and then for for KL, we are the actual matrix itself is picking out something like 60 gigabytes of memory, so that will never fit into MC DRM, but the x and y vectors and tend to be under 16 gigabytes.

A

So we we use the the directed fax amount directives and Fortran to explicitly put those just those two rays, so we stream a through out of em CDR out of DDR and then just operate on the X&Y captain, MCD RAM and actually, in this case it's sort of a funny coincidence, but the the ratio of data movement on DDR and MC DRAM. We can, by choosing number of right-hand sides, we can kind of match that ratio to the ratio of bandwidth that de R + MC d RM can provide.

A

So we can get a pretty high fraction of sustained bandwidth from both DDR N and n MC degree. At the same time- and this is just a little model showing showing how well we can do so- the blue line is the kind of maximum expected performance that we would expect if we could fully utilize both memories and once, depending on the number of right-hand sides that we're using you can see with it with eight right-hand sides we're getting pretty close to the best we could hope to do, which is good because anything past eight.

A

We sometimes have trouble fitting it within sixteen gigabytes.

A

So, just in in summary,.

A

The every change we made sped things up on every machine that we run this on, but the changes for vectorization really made the biggest difference on KL.

A

You can especially see that in the one two and three kernels, which are all part of the construction phase and then and the the fourth kernel- that's the that's where we switched from matrix-vector and matrix-matrix products.

A

In some ways, it's a little more it's much more involved than it sounds because we had to switch completely change. The eigenvalue solver from a land soch to a.

A

Block preconditions, method and then just some kind of bigger, broader takeaways, that in this specific case we did, we did get a speed-up over using just cache mode for a very memory intensive code, but that's not necessarily true in every case, and it also requires a lot more effort on our part to to manage where the data is actually allocated. So, in general, the cache mode is within ninety percent of the flat mode code and then another thing we found as we've scaled. This up is that you really need at least for rent processes.

A

Even though this is a very highly thread scaling code. We really need at least four MPI ranks on every node to actually use the network.

A

And I guess there any questions.