National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Mixed precision optimization for lattice quantum chromodynamics

Description

Kate Clark from NVIDIA presents a talk on Mixed precision optimization for lattice quantum chromodynamics. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Hugo Brunie

A

Thank you very much for the invitation to talk here today. um My apologies ahead of time, I'm a keynote speaker and I tried to convert to powerpoint this morning to get the captions, but I failed so I'm giving this with a keynote.

A

um So beside the caption free, um I've probably got a lot too many slides to talk about in this session here. But I'll do my best and um you'll have the slides at the end and I'm happy to answer any questions offline now um to start off. We think about why we're going to do mixed precision anyway in hpc. There's of course, many reasons.

A

um Current gpus and other hardware have acceleration specifically for low precision, and but there are other reasons as well, such as reducing memory, traffic, reducing network traffic, producing memory footprint and also sometimes you want to change the precision and for suitable numerical properties. I don't just mean going from single precision to double precision, but also using things in integer instead of a floating point, for example, and the whole goal of using mixed decision is really to accelerate or improve the algorithms without compromising the quality of science to accelerate the time to science.

A

So the case study will be talking about today is lattice quantum chrodamics qcd, um I'm not going to tell you much about what qcd is other than it's. The science of subnuclear physics and qcd is the fee of the strong force which binds together the neutrons and the protons. It's a very simple theory to write down. We have a path integral here, it's a very beautiful theory. um Unfortunately, it's a highly non-linear theory. You can't solve it by using feynman diagrams.

A

That essentially means you can't use a taylor expansion to make predictions and you have to resort to numerical methods and lattice. Quantum chromodynamics is the method of choice, and here you take the infinite space time and we put it on a finite periodic box. So we end up with a four dimensional grid, with periodic boundary conditions and the original partial differential equations and are now turned into finite difference equations, which means stencils um on a regular grid, and we have to solve problems of the form ax equals b.

A

So a is a large, sparse, matrix, x and v are vectors. Now it's important um in in this conversation here, because it's a very large consuming of consumer of public super computing cycles, so 10 plus percent of super computing cycles in north america and europe in japan, and increasingly so in china and india.

A

So it's traditionally very highly optimized on every hpc platform, that's been around for the last 30 years in the us. Specifically, we can see why we care about this um on the two pie: charts on the left, we're looking at the utilization or the allocation and utilization on the summit machine at oak ridge by latches qcd, and they have in inside 2019 the approximately one sixth of all cycles are allocated to them, but because lattice qcd is always ready to absorb any cycles available. They actually took more like um 35 40 of all cycles.

A

In summit are spent during last qcd and nurse is a bit more conservative in giving time to qcd physicists is more like 13 there. You can see it's split between the applications, chroma cps and milk, so it's an important application to optimize for hpc centers, um there's two major steps in elastic qcd calculation, there's a large strong scaling, monte carlo phase and a second phase where you take all the configurations which are generated by this monte, carlo um metropolis, uh strong scaled algorithm, and you analyze all these configurations so step.

A

One requires strong scaling running on leadership facilities. Step two is a task, parallel job that can be done on a smaller scale or a more easily paralyzable scale. So it's much more task parallel and both of these steps um require solving this ax equals b problem. This problem in the top right here, um ax equals b. Whereas a is the discretized um dirac operator or direct stencil, so and you're not sure what lattice qcd is you.

A

You run your simulations, you strong scale them you're strong scaling, the stencil problem in many gpus or many processors, whichever your architecture of choices, you you run your simulations. You come up with predictions. You then compare them, on the other hand, to experimental results such as what comes out of large hadron collider and in in geneva, and ideally you're looking for deviations between fear and experiment, because that's the sign of new physics.

A

um So what cura is is the application I work on. Cuda is a library which accelerates lashes, qcd on gpus and it's written using modern c plus plus using cuda c or cura c, plus plus, and it provides acceleration for all these major elastic qc applications which absorb all the time on the leadership facilities. So the ones that point out here are chroma and milk and cps, and what we talk ab. What we use a lot is mixed precision methods to optimize performance.

A

So that's what I'll talk about now and it's really much a research tool to or how to reach the exit scale using algorithms and techniques. So here are some of my contributors towards the library or my collaborators. The folks I highlight in right here are all the folks that work in nvidia we have quite a few former qcd physicists coming to nvidia.

A

So mixed precision, crew of solvers, what we're really talking about today, so we have this dirac stencil and it's a sparse matrix, essentially a sparse matrix problem and like any sparse matrix problem or any stencil problem. For that. On that hand, it's very um memory bandwidth bound. So the the arithmetic intensity is quite low. It's around one in single precision, so on any modern architecture, you're going to be very memory bandwidth bound and what performance you can get. So what we have to do is maximize our arithmetic intensity, using whatever technique possible.

A

um The main technique I'm going to talk about here is using uh low precision so going from signal precision to uh 16-bit precision. You immediately double your performance in terms of gigaflops, um because your memory bandwidth bound uh so here's an example of some uh performance scaling just to demonstrate this um we're looking we're increasing the size of the box.

A

Here, l was just the number of grid points, we're comparing double precision in single precision and you can see double precision, saturates, 700, gigaflops, single precision doubles the performance and that's purely because of the memory traffic has halved, and so we double the performance. We also have the 16 bit fixed point format. I just alluded to. We don't actually use fb16, we use a custom, fixed point format for more precision because we want dynamic range over the grid, but we need more anticipates on each grid point.

A

So we use what's what some people call a block float format. Each of the grid points um is actually a vector of 12, complex numbers. We use a um so we use that you store that vector of 12 complex numbers using a 16 bit fixed point format with a normalizer like an fp32 to set the scale so over the whole grid. We have dynamic range of fp32 but locally. We have 16 bits of mantissa. This gives us a combination of precision and range which helps with the stability of the numerical solvers.

A

So I'm using this only as a storage format, not the actual computation which is done in fp32, so it's load and store and low precision do compute an fp32, and you can see here that the stencil performance effectively doubles. When you do this.

A

This blue line here is showing the comparison of half precision or the 16 bit fixed point versus single and double, and you can see each time you have the precision representation, you double the performance or more or less double you actually slightly sometimes increase the performance more than a factor of two and that's shown on the right hand, side here with the effective memory bandwidth that each thread is achieving, and this is because um we achieve more than the memory bandwidth of the device which, in this case is 900 gigabytes per second peak, and that's because you have l2 reuse, lots of lookout in the problem and the l2 and the gpu is very good at capturing the locality of the stencil.

A

Now it's all really great having a low precision stencil, but it doesn't help unless you have a um if it unless it improves your time to solution and the time to solution must be measured against solving some high high precision problem. So you want your actual solver good to double precision, accuracy or something close to double precision accuracy, so we're in the entire cg solver by cgstab, solver, multi-grid, solver, etc.

A

On the gpu, as we reduce the as we reduce the quark mass, this is the parameter which is fed into the stencil, which affects the condition number and, as we reduce that parameter towards where reality is, what happens is the condition number blows up and it eventually becomes singular, and so, as you make the problem more realistic, the the condition number becomes harder and you have to treat the numerical solver with much more care as to how you converge that solver in using low precision most of the time, but still getting the high performance.

A

The high precision result, and so you have to be careful not to violate the no. You have to be careful, otherwise you uh you do and you do not end up with a new free lunch.

A

So in the context of conjugate gradient, we found that there's four key ingredients to have a high precision solver be doing most of the work with low precision and now the traditional approach when you use it of solvers um is to use something called defect, correction or iterative refinement. You have an inner outer solver where you solve an inner system to using low precision.

A

You finish that solver so say you finish: the inner cg solver to 10 to the minus 4 accuracy or 10 to the minus 5 accuracy. Then you calculate the outer residual and high precision and then you restart and restart the system and do another inner solve with low precision, and then you keep repeating that until you've converged.

A

Unfortunately, when you do this, you throw away the um generated preloft space that the solver will have generated, and you end up doing many more iterations to reach conversions. So we use something different, which is called reliable updates, which is an old technique developed in the 90s to improve numerical stability by cg stab, but it allows us to keep the query of space resident, but do both the work and low precision while maintaining high performance high precision. I refer you to the paper here and for more details.

A

We also had a few other important ingredients such as reprojecting, the residual vectors or reprojecting the direction vectors in against the residual vector every time you do. One of these flying updates, you have to make sure the solution. Vectors are always stored in double precision.

A

Even though you're doing most of the work in half precision, you need to make sure the cumulative solution, vector is always done in double precision and an another thing we inherited from non-linear cg was to use a different form for calculating the beta coefficient, which is um identical in infinite precision, but it's it's more numerically stable and it's self-stabilizing in fact, and improves the orthogonality of back to back iterations in conjugate gradient, and what we're looking at here is a comparison of in in green, a pure double precision.

A

Conjugate gradient solve doing some number of iterations trying to converge to 10 to the minus 7 accuracy. If you just plug in naive, low precision, um conjugate gradient and without these recipe I just described, then you end up with a spiky black behavior. If you use this recipe described. However, you get this nice smooth red line, so you get some small overhead and iteration count, but you can still converse your target system um and uh and get the free lunch you desire. You can see here the condition number.

A

The system is 10 to the four, as we make the condition number of the system uh higher, so increase it by two orders of magnitude. What you see is that the original naive, conjugation solver didn't converge at all. You just got this mess of convergence that this black line here and in fact it never converged. You just protect. You've got this oscillating behavior. The red behavior is of the stabilized solver and you'll, see, although it's not as stable as the green, uh pure, double precision result.

A

We still achieve the target accuracy and just with a small overheads, a 10 overhead iterations.

A

If you plug that into the actual solver, you can see the actual speed up you get from using this more careful uh use of precision, so double precision going down to double single, going down to double half where double half gets the best overall um time solution in this milk application. So here you know, using the mixed precision was an important um like these milk calculations can run for months at a time. So this can, you know it creates knocking off weeks of a calculation.

A

We went down to 16 bit precision it's interesting to think about how low can we go in this precision? Can we go down all the way down to eight bit and uh can we still converge that conjugate, gradient solver and the plot on the right here? I'm just showing the performance of double single half and now quarter. You can see quarter improved performance again on a voltage gpu, not quite the factor of two now, because other things are coming in, like indexing overheads and load store, instruction overheads, and things like that.

A

You can still see we get a nice boost in gigaflops, but when we actually look at the solver and whether it converges or not what we see that in the regime of physical interest, which is shown in this pink area of this plot here and on the x-axis, I'm showing the condition number of the linear system- and this is the area we care about in terms of physics- and you can see that if we look at the iteration count um on a logarithmic scale and looking at um the double double double single, double half and double quarter, you can see that the first three of those all lie approximately on top of each other.

A

So they all have an approximate iteration to the similar iteration counts. You can see. Eventually, double half starts to diverge, but in the regime of physical interest we basically keep the iteration count fixed, whereas double quarter, and where we're trying to converge to double precision accuracy doing most of our quarter. Precision we just do not get there even at very easy to solve problems. You cannot converge it without a huge overhead.

A

So eight bit is a bridge too far, so 16 bit is where it's at it would seem now, let's switch to multi-grid um now multi-grid is the the optimal way to solve these types of stencil problems, and that's why we care about it. It's much faster, much more numerical stable and it's optimal in terms of uh it has it's flat with condition number.

A

So, as you change the condition of regular system, you have no increase the number of rich number of iterations of your solver and and also it has order um n scaling, which means, as you double your problem size and you steal, your iteration count stays fixed as well. So this is the solve. We really care about.

A

um I'm not going to talk about the details of the form of the multi-grid solver, that's used for qcd, but it's a it's a form of geometric multi-grid, with special adaptivity and included, and careful treatment of coarsening for non-hermeticity.

A

um Now multi-grid in some ways is the ideal algorithm for mixed position. Each passive multigrid, so each v cycle, if you're familiar with multigrid, um only reduces the the residual isn't by an order of magnitude. It's very much an approximate solver. Very often, you would wrap it as a preconditioner for an ultra creal of solver or you harbor. Do multiple passes of multi-grid steps to reach your target accuracy, so each iteration of multigrid doesn't have to be that accurate and, and so that allows you great freedom to reduce precision dramatically inside the multigrid solver.

A

So the method that we were looking at was just to wrap the entire multi-grid cycle. um Sorry do the entire multi-grid cycle and 16-bit precision and then wrap that in a outer gcr solver. So we keep the altitude solver in single precision um and again an outer precision outside of that in double precision.

A

To make sure we can converge to final double precision, accuracy, but again most of the work's done 16-bit precision you get a nice reduction in peak memory, but the wonderful effect about doing multi-grid in 60-meter precision is you have zero effect on the multi-grid convergence, there's no effect at all on the iteration count, even as you go all the way down to that near singular point in terms of the condition number uh a plot in the right hand.

A

Side here is showing one thing you do have to be careful with mixed precision is latency, so your lane stays latency stays fixed, of course, as you go from double to single to half and as your performance goes faster, of course, that latency and effect will become a bigger, bigger overhead. So as if in multigrade, you go to smaller and smaller grids. As your solver courses, if you're going from ten to eight to six to four to two eventually you've got a very small problem size.

A

You have no benefit from low precision, because you're completely latency bound. You know a two to the power four grid does not saturate a gpu, and so you don't get any benefit from the low memory memory traffic.

A

um Another interesting thing we found uh in the multigrid setup uh just in case um the the time of the talk uh is reaching its end. Could you uh wind up in one minute, yep I'll, be I'll, be quick. um So when we're doing coursing in multi-grid, we would like to um do it in parallel, uh using massive parallelism and that's difficult.

A

If you have to add lots of numbers together, and so instead we will use is integer atomics, because there we have determinism um and associativity, and it means we always get the right answer, and so we always get the same answer all the time, which is very beneficial.

A

When you put all these things together, this optimized multigrid solver with four different precisions here, so double single half and in 32 for these different parts, the solver, you could see a significant speed up um versus the the solver we had before, going going from titan to summit and changing the algorithm. We had an 82x speed up altogether, and I won't talk about this plot here, other than to say, 8-bit is actually stable. The multi-grid, unlike the case of cg, and so we can use 8-bit all the way down to you.

A

You can use multi-core with 8-bit precision now just wrapping up here. I just wanted to highlight some of the work we're doing with tensorcores so tends to cause these high throughput matrix multiplication instructions on the volta and ampere architectures, um specifically with extremely high throughput for low precision such as fp16 we've been deploying these uh uh these instructions to accelerate some of uh that accelerate some uh throughput accelerate some calculations on the summit architecture.

A

um What we see is that preconditioners, which are designed to reduce network communication, which are too expensive without tensor cores, are made um possible now with tensor cores um and as a result, you know at large scales. We can use tensor cores to accelerate these large strong scaling workloads.

A

That would not be possible otherwise, and we turn a problem which will completely network bound now into being compute bound and finally, just to highlight the fact that we're now looking at using tensor cores throughout different aspects of the latches qcd workflow and what we recognize you know after. The fact is that many of these workloads are actually matrix multiplication, and so they are directly applicable to being reformulated into tense. Of course, um so there's an ongoing work there to turn these um matrix multiplication kernels.

A

As uh into explicit tensor core usage- and we see significant speed up when doing that, they, you know what we expect when this work is done, is we'll have a 2x throughput of chroma running on summit, for example, and as noted at the beginning, that's a significant overhead of what runs in summit now. So we can achieve the 2x speed up by using the tensor cores, which are there that's great.

A

um Just to conclude my final slide and before I'm kicked off, um I would just advise generally don't use double precision without considering what precision is actually required in the calculation judicious use of precision tuning can lead to large speedups more than 4x is what we've seen here and creativity is required, such as um using integers, uh where necessary, instead of floating point, I'm using precision at low precision as a storage format, as opposed to just for computation, and some algorithms can utilize many more than you know, two precisions, three four precisions, or sometimes the optimal approach to take, depending on the calculation and future work.

A

That we're doing here is to recast a lot of these algorithms to take use of these tensor core uh functionality, which is on current and future gpus, to really give a scalable solution to the future. Okay, and in that I thank you,.