National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 8 Case Study 3: AMReX (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

Hi, my name is Brian here earlier this morning. This is a less of a case study and more of a work in progress. This is a relatively recent undertaking and so in that way, it's a little more exciting, but I also don't have nice performance plots like ZZ head. So it's just I'm sort of thinking through how to solve this interesting, algorithmic problem, and so that's what I've documented here a little bit so.

A

So I've been working on this code called MRX, which is one of four co.design centers in the exit scale computing project, so MRX is actually developed here at berkeley lab and it's a framework for developing block structured, Amr applications.

A

It's derived from box live, which was the previous framework for doing things like this, and also as many elements of the charmville framework, which is also developed here. It AM Rx, is a framework and so science applications sit on top of it and those applications do many things. Most of them are fluid dynamics of some sort, there's reacting flows, so combustion chemical combustion, nuclear combustion for astrophysics and terrestrial applications that you load Mach, number flows, compressible and incompressible things. There's even an N bought there's a there's, a particle element of a more X.

A

So if you're doing particle grid or particle fluid interactions, you can do that and then they're also working on some really interesting things with embedded boundaries and and complex geometries, so yeah. This is all the MRX has many players in it, and this is this as part of the de exascale computing project.

A

So one of the applications that I've been working on for the last several months, a few years, in fact it's called Nick, which is a cosmology code. Nick's, is actually a hybrid of two sort of algorithms. There's a compressible flows, algorithm and there's also an N body component. So Nick's does cosmological simulations that in court, both Baryonyx matter, which is to say, fluid that has it's a fluid that collides with itself.

A

It has viscosity and so on, and it also simulates dark, collisionless dark matter, and so for that element they use particles and then there's a nice particle mesh interface, where particles can interact with an AMR mesh and vice versa.

A

So it's used for simulating early universe, things for for simulating the for the formation of filaments in the universe, Dark Matter, halos, a GN feedback stuff. There's there's a lot of applications of this code. So actually this it doesn't show up that well, I! Think on the screen, but that's a this is a simulation of filaments forming that I actually did on Cori not that long ago, and it's part of a movie and it looks really nice.

A

So there are a number of different sort of algorithms that are all operating at the same time. In Nix there's you can roughly narrow down to four. There are, as I mentioned, the compressible flows component and then the the n-body component, those relatively speaking, are, are fairly cheap to compute, there's a so the compressible flows algorithm. It requires some information from nearest neighbors of cells, and so it's sort of like a stencil.

A

The n-body stuff is much more chaotic, it's you're, basically pointer chasing because the particles can kind of go anywhere and you have to figure out where they are on the grid. So that's much more sort of random access. It's a very memory, latency bound algorithm, but the reason it's cheap is because the particle density is actually very low. So, whereas in in particle and cell codes for fusion, for example, the particle density, the number of particles per grid cell can be extremely high in cosmology, it is sort of the opposite.

A

It can be of order ten or even one. So that's the reason it's cheap. It's not that the algorithm is cheap. It's that we have the particle density is quite low, there's also a multi grid, solver in Nix. That's done! We they do a Poisson. They solve a Poisson equation for gravity, so that's our elliptic solver that is relatively cheap, except at scale which is fairly typical of multigrid solvers because they have it. There's a lot of communication.

A

That's required as you coarsen and as you course in grades and then and then and then, as you do, restriction and then prolongation back up to the fine grids. There's a lot of communication that goes on and that becomes very expensive at scale. So, except for running at the largest scales, for example, on half or all of Koree, the multi grid component is relatively cheap. The really expensive part is actually radiation, which would not come as a surprise to anybody. Who's actually done radiation transport before that. But, interestingly, there is actually no transport.

A

That's done in Nick. That's all point-wise calculations, that's basically micro physics, so there's no stencils involved or anything like that, but it's still extremely expensive at all scales. It's basically independent of scale, because it's a point, wise calculation and so I'll show you why that is and then and then why the way I'm trying to fix it for KL, especially so in every cell.

A

In a Nix calculation, we have to solve a heating cooling equation, which decides it takes the the ionization state of some number of atoms and it confirm that it computes how much radiation is deposited into the gas and, conversely, how much is cooled, how much, basically, how the atoms interact with the radiation and it's a it's a first. You can write it as a first order differential equation.

A

That looks like this thing right here, which I made quite simple, but this equation is a little bit strange, because every time you evaluate the right hand side this f of Y and T, you actually have to solve an equilibrium constraint equation which depends itself on F. So every time you evaluate the right hand side, you have to do some kind of route find.

A

This G of F is a highly nonlinear function, and so normally, when you have so it's worth saying that this, even though it's a there's only one equation, this is not a system of differential equations, it's just one, but it can be highly sensitive to the the the dependent variable there. So you can see large spikes in the solution as you evolve it in time, so you can sort of characterize it as stiff.

A

Even though it's formally it's only one equation, you can't really call one differential equation stiff, but it has characteristics of stiff systems of differential equations and normally, when you have sort of misbehaving, OD e's like this, you need some kind of stiff OTE integrators, so something that's very simple like like Rangga cutted will often give you the wrong answer on with with equations like this, because it'll they take the wrong time.

A

Step in your solution will sort of spike all over the place and RK doesn't rk4, doesn't capture that or it may not capture that and lets. You take extremely small time steps so I for this reason, because we had generally have to evaluate the right hand side many times and because every right hand, side evaluation requires doing a route find. This is in fact the most expensive component in all of NICs is solving a so ve and it has to be done in every cell so to refresh people's memory.

A

We just do we just do a newton method for solving this, it's nonlinear, so this is sort of a natural way to go the problem. There are a number of problems with doing lots and lots of Newton methods on KL.

A

They include, for example, the lack of vectorization between iterations, so within a cell, you can't really vectorize much of anything because there's a data dependence, so as you solve for this parameter, as you try to converge, this lambda toward 0, the solution of lambda, n plus 1, depends on the solution at way of lambda n, so that data dependence makes vectorizing along the iteration axis, not possible and so effectively not effectively. In reality, all of this, all of these OD integrations in every cell are done.

A

Scalar and so you're throwing away a huge amount of performance, that's available on KL, in particular all those wide Sindhi units. So we can't really vector eyes within a cell. How are we doing on time?

A

I don't have my my watch with me 23 okay, so we can't really vectorize within a cell, and so the solution that word that were implementing right now is actually vectorizing across cells, so in effect we're trying to we're trying to integrate eight of these OD es simultaneously, which is an interesting challenge that I've never encountered before it's a lot of fun, so Cindy late, so unlike thread symbol, Ames operating lockstep. So if you have a threaded region, your threads can diverge in their workload and do different things. That's not a problem.

A

Cindy lanes can't do that as they have to do the same thing in every Lane, and so, if your algorithm has any kind of variability due to a convergence criterion or something like that, that makes doing Cindy a little bit challenging and that and variability is certainly a component of what we're trying to do here. There's actually two chief sources of variability.

A

Among, for example, 8o des from 8 neighboring cells, the first one is that the number of iterations to do the number of newton iterations that you have to do to satisfy that equilibrium constraint, is unique to the initial conditions they into the right-hand side of every cell. So the number of roots of the number of iterations to do your root find might be different in neighboring cells.

A

The other source of variability is that we use a stiff integrator to to integrate this differential equation and it's actually an adaptive method, and so the number of steps that it will take to to integrate from time t0 to t1 may be different for two neighboring cells. So again, there's it's challenging to handle that kind of variability in Cindy, but we have some methods for doing that.

A

I think one of those consists of vectorizing the right hand side so that f of Y and T in Nix that's actually a little bit complicated because every right hand, side evaluation requires a Newton solve.

A

So what we do to vectorize the Newton solve is is sort of we. We do a few extra sort of pointless flops in order to do everything in Cindy. That's a fairly common pattern in writing. Vectorizable code in particular. What we do is we just keep iterating until the slowest converging cell has converged, so you might have one cell that only takes 2 Newton iterations to get to zero and the cell right next to it.

A

Take 9, while the cell that got there first is just going to keep bisecting it or not bisecting, but it's going to keep trying to do that. Root fine and get even closer to zero. So the hope is that it won't actually diverge, which theoretically is possible to do in our test that actually these are fairly well behavior group finds, so they don't do that. So that's how we deal with that kind of with that's how we vectorize the right hand. Side is actually vectorized. 8, simultaneous newton solves the other.

A

As for the other source of variability, we're vectorizing the the we're trying to vectorize the stepping routine itself and it's a similar spirit, so a well a well behaved. Ote might only take 3 steps to get from time, t0 to t1 in that integration and another one might take 10, and so what we do is we just pick the? What we're trying right now is we just pick the slowest, whatever the smallest time, step that it needs to take in order to satisfy the convergence criteria.

A

We just do that, so even for a well-behaved OTE, we make it take really small time steps, just like all the others, and that's just to keep that so everybody can operate in lockstep. That may not be that. That's not our final answer. That's sort of one thing that we're trying I think there are better things coming in the pipeline, but that's sort of a first-order approach to to solving this problem, but again we're doing a lot of extra floating-point operations that we in theory.

A

We don't need to do, but the net benefit is still worth doing, because you've done a few extra that you don't need, but you've also are doing eight at a time, so the speed-up overall from doing this, will certainly not be a perfect factor of eight but it'll. Be a larger than one as opposed to doing it all on scalar, which is how it's done right now.

A

That's unreadable, so I will skip that. I think this is also pretty unreadable. So this is just the last bit of the Newton of the Newton method. That's in Cindy now, so you can see. Here's the Newton, here's the right-hand side of a Newton iteration, where you take the function divided by its derivative and then all we do there's actually a bug in this. This last loop doesn't need to be there.

A

It's just need to do that evaluation once, but I just checked that the convergence criteria I what I check for the largest error for all eight lanes or for all eight Cyndi lanes, and if all of them are less than the required convergence, then we're done and if not, then we keep going so that this kind of thing looks a little weird in code, but but it works very well on KL leaves in preliminary tests.

A

Okay, so one final point: there's if you're using an explicit integrator, that's still adaptive like rkf for five, this ones that rkf for five is actually pretty straightforward to vectorize. That's it's a it's an ADA great! It's an integration scheme. It's probably the simplest adaptive integration method.

A

That's because explicit methods, they just evaluate the right hand, side and that's how they that's how they decide when they're converged. If you do something implicit like boad, which uses BDF implicit methods are much more complicated because generally they're solving some kind of linear system at every integration step, because implicit methods lead to a Jacobian matrix that you would then have to solve for and so solving eight systems of linear equations in scindia's is not so straightforward.

A

So that's something that we're still thinking about right now, but this is all work that we're doing with a number of other people in ECP. So it's we're. Certainly not this.

B

Is not a hero, effort.

A

Where there's a lot of people involved in doing this, I think I'm, probably out of time. How am I doing Rebecca is that it.

A

B

Another question short go ahead, so it's kind of analogous to this parallel and time stuff. Had people been working on, I, wonder even consider parallelizing a gallop iteration solve five, something scary, a bunch of points, because, if you have two logarithms are not quite right. Other things that you're moving from G the next innovation, not by life where, if you just kind of situate different tries for each iteration pick the best one that that is a starting point to the next generation.

A

Data, that's a good point. um I hadn't tried that yet, but I I see no reason why that wouldn't work so maybe that'll be step two. So thanks. Thank you for the comment. Look.

B

See you later Justin talk, okay, anything.

A

Else anything on chat, I, don't see anything.