National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lattice QCD (LQCD) Project

Description

Steven Gottlieb (Indian University)
Lattice QCD (LQCD) Project

A

Thank you, I hope you can see. My slides um I'd, like thanks for the invitation to speak, particularly to Neil and Rahul I'm, going to talk about lattice qcd, which involves more than one group. uh First I'll talk a little bit about lattice qcd, some of our accomplishments on Pearl mutter, uh then a few benchmarks.

A

Yet sorry um that is not good Let's see we can just re-share I will try to do that. Oh I think I know what I did wrong.

A

Yeah, we can see them now, thanks right, I forgot to click on share all right. Sorry, um so Quantum chromodynamics is a 50 year old, Quantum field. Theory of the strong interaction I happen to know it's 50 years old, because I'm involved in preparing uh a volume coming out called QC 50 years of qcd, um so it describes quarks, which are particles of matter and gluons, which are the force carriers. These are the analogs of QED, of the electrons being matter and photons being the force carrier.

A

uh Quarks have this quantum number that we call color, sometimes called red, green and blue, which has nothing to do with regular colors. It's responsible for making bound states of quarks and anti quarks, which are called mesons and baryons, which are three quarks bound together, and these have no net color. uh The nuclear force is a residual uh is due to a residual color force between protons and neutrons, which, as I said, do not have color themselves and people often talk about the Higgs field as being the origin of mass.

A

Well, our Mass actually comes from qcd, not the Higgs field. You should be aware of that when they make those claims, so um Ken Wilson developed, lattice qcd to go beyond perturbation Theory, which doesn't work so well at uh at low energies in qcd, but High energies. It works well and my thesis advisor got the Nobel Prize for that.

A

um What we do to for lattice qcd is the Continuum of space-time is replaced by a four-dimensional grid of discrete points, and then the quarks are described by complex fields which either have three components, which is the kind I use called staggered, or three uh color components and four spin components, which is uh the original Wilson formulation. There are also some other formulations.

A

The gluons are described by three three by three complex unitary matrices and uh Wilson's. Brilliant Insight was to describe the gluons by these unitary matrices, which are in the gauge group rather than being elements of the gauge algebra.

A

um So the basic calculation we do is like a Feynman path integral, but we have to change the theory to imaginary time, and that makes it a lot like a statistical mechanical partition function. The numerical methods include Monte Carlo um lots of random numbers. uh Most of the time goes into sports Matrix solvers, and we actually have something like molecular Dynamics in a new simulation time Evolution.

A

um So we're constantly uh updating things according to this small step size. So um there are two things we need to do. The first thing is to uh calculate ensembles of these gauge fields, and these are basically pictures of the qcd vacuum and they have to be properly weighted paths in the path integral that's, basically what they are and then uh to do a physics calculation. We have to take averages over these gauge fields in an ensemble and naturally, larger ensembles get better statistical averages and they help us average over the quantum fluctuations.

A

um So to carry out a physics measurement, you have to control systematic errors, and there are several of them so first to generate an ensemble, you have to make some choices. You have to calc, you have to decide on a lattice, spacing the smaller the better um and you actually, you don't set the lattice. Spacing you set the strength of the gauge field, coupling and you determine the lattice. Spacing later you have to put the system in a finite size box.

A

So I call that n spatial cubed by end time and then we use generally periodic boundary conditions in space and anti-periodic for the quarks. In time, then there are the quarks and the ones we put in this in the calculation are the up and down Quark, which are the lightest ones, the strange quark and the charm Quark.

A

We don't generally put the bottom Quark or the top Quark, because they're so heavy compared to the qcd scale, and we usually set the mass of the up and down Quark to be the same, and you have to tune those properly or you don't get the proper masses in the theory and uh well actually I'm just got into the next bullet to control the errors you have to make the lattice spacing smaller and smaller a goes to zero.

A

You either take a infinite volume limit or just a big enough box, where you don't think the effects are significant and I already said: you'd have to tune the Quark masses to get them right. So why do we use so much computer time? Well, it's because controlling each of these systematic errors involves uh investing more time. If I half the lattice spacing it's going to increase the time of the calculation by about a factor of 2 to the sixth, that's at a fixed physical volume.

A

If I want to increase the physical volume, say double the linear size, then you get a factor of 2 to the fourth, because it's x y z and T, or if you're, only exchanged making space big or two cubed um and tuning the up and down Quark masses to their physical values.

A

It's not a direct um Factor it, but for many years it was too expensive and and now we can do it, so we have ensembles with very closely tuned physical Quark masses and then, um when we create uh these uh costly ensembles through the stochastic Evolution to get these snapshots of the vacuum. The iterative solver takes much of the time and you like to do this in a few stochastic Evolutions. So you like this to run quickly and it's more of a strong scaling problem.

A

At times once The Ensemble is generated, uh you store them on disk and tape, and you can run several measurement as we call them jobs in parallel, so creating the gauge Fields. You want high speed more like a capability problem, doing physics analysis more like a capacity problem, but it's capacity. It's still a high rate of speed, so I'd like to talk a little bit about some of what we've been able to accomplish on Pearl Mudder I was a little bit late in asking the people who are involved in this uh Nissan project.

A

To give me some results, and uh a lot of this has to do with what my colleague, Carlton, dutar and I have done we're in the fermilab lattice and milk collaboration. um Chris Kelly did send me. Some information and I have some information there and I got another slide today, uh which I should have time to show so datar and I are interested in the decay of mesons that contain a bottom Quark, it's a challenging calculation and we need to have a fairly small lattice spacing for that and the reason we're interested in this.

A

It helps determine a fundamental parameter of the standard model, which is the elements of the ckm mixing. Matrix uh k, m khabibo and Moscow won the Nobel Prize for defining and realizing that a three by three Matrix could explain: CP violation in the universe, but they didn't say what the values are. We have to do our calculations and involve experiment to actually figure out the values and a key issue is whether there's evidence for new physics, because the Matrix doesn't has to be unitary in Quantum in the standard model.

A

But if it's not- and we could discover that if we tightly constrain the elements that would be evidenced for new physics, so I created a bunch of uh new gauge configurations on basically the toughest lattice. We have or toughest Ensemble that we have generated for which we've generated configurations, and so this actually started uh in 2014 generating configurations and what I have here is a folded uh timeline of what happened. So you can see in 2014 we had about 400 time units and our goal was 6 000 time units of running.

A

We save a configuration every six time units, so we wanted a thousand configurations and you can see we went along and along for about four years till mid 2018, and uh this was done on several computers that I'll talk about later and then there was a gap for about three years until Pearl Mater came along and you can see the red plot for Pearl Mudder wow. What a change in the slope- and this is the power of coral motor for us but I- have another comment about the power of pearl Mudder.

A

That graph was prepared in December 2021 Well turns out. uh Promoter became much busier busier when it wasn't so much early science and everyone was allowed on Pearl monitor. So you can see that our rate of uh progress slowed down considerably, so you both need a fast computer to do lattice, qcd and a sufficient allocation to uh to to use it. So we uh completed our goal of creating 500 new configurations in the first quarter of the year, and datara has been able to analyze about 50 of them. uh So a few comments.

A

The Jefferson lab group uses qcd jit, Cuda and chroma for their work. They did a new multi-grid solver and had some significant algorithmic improvements and a job that took 1 192 seconds on 256 Edison nodes. They were able to run in just 80 Seconds on 32 Pearl model nodes and it's a combination of the hardware being much faster and the algorithm being improved.

A

um The RBC UK qcd group has been doing two projects on Pearl Mudder one involves the muon anomalous Magnetic Moment, uh which is a an experiment that was done at Brookhaven initially about 20 years ago and uh a little bit over a year ago, a new result was announced at fermilab, and this is one of the best uh or had been one of the most intriguing pieces of evidence for new physics in the standard model. That's it's! So it's still a very important calculation.

A

um At any rate, uh their first project was using 256 nodes to analyze the 96 cubed by 192 uh grid and for the second project, they're running on 32 nodes and they've, looked at two different uh uh size grids with a domain wall fermions which involve a Fifth Dimension and one of the interesting things I found out. There was a 5.9 hour. Job on slingshot 10 was reduced to 4.7 hours after the upgrade to slingshot 11..

A

um So here I have a cross platform comparison for the lattice generation. That I was talking about before uh those four years of running were a combination of Edison, Corey, Blue Waters and the red line, of course, was Pearl Mudder, and uh so here you can see that what took from five and a half to eight almost nine hours is reduced to 1.53 hours on Pearl Mudder, which is very nice, and uh you know this doesn't take into account how many nodes there are, so you could multiply out node hours.

A

uh The question mark on Blue Waters is because I can't remember uh what we did about hyper threading there. So sorry about that you'll get at least a within a factor or two of node hours.

A

um So I wanted to say something about our performance um on Pearl mutter. uh So for the production running, uh I used, I mentioned 128 nodes, I'm, pretty sure I could have run it on 64 and had higher efficiency. uh I think 128 was dictated by how much I wanted to get done within the maximum wall time we were allowed.

A

um It's a four-dimensional grid. I cut the grid up only up only in three dimensions which helps reduce Communications. So the 144 uh you know was not cut the X Direction, but y z and T were all cut. I was getting 285 gigaflops per GPU in single Precision, which actually in a mixed, Precision solver, so part of it is half Precision for Lynx, mirroring I was getting 150 gigaflops per GPU um for the gauge Force I was, is getting 1.5 to 1.7 teraflops.

A

That code avoids communication I noticed about a decade ago that it was really slower than it needed to be, and the lynx mirroring and Fermi on Forest, which I don't have any result on here, probably could benefit from that. So here is uh just of uh the conjugate gradient solver of finite volume study, so I have different uh traces for different numbers of gpus, so the labels are all on gpus l means the local volume on each GPU was L to the fourth grid points.

A

This was an early study. I could do a better job now, but I noticed when I tried to redo this uh Pearl Mudder was very busy and I couldn't get my new benchmarks done in this case um with when you go to 16, gpus and Beyond. We're cutting in all four mentions, as I mentioned, even on bigger problems. We don't really have to do that.

A

So you see within a node one two and four gpus the performance is quite high and uh it it mainly depends upon l um when L is say, 32 or larger you'd be proud of the performance. You probably don't want to do production running. If someone asks how efficiently you're using the computer at below 20, but you can also see that things scale pretty well uh Beyond 16 when we're no longer increasing the amount of communication. This is a similar plot for the gauge force and view of time.

A

I think I'll not say so much about that. um So I was asked to say something about software development, and so the lattice qcd Community has been creating Community software and sharing it for a long time. The Cuda project, which is the code one of the codes that we use very heavily to uh make use of Nvidia gpus, began in 2008 at Boston University, two of the main developers from bu Kate, Clark and Ron babich.

A

Now work for NVIDIA, one of my former postdocs Matthias Wagner and another former postdoc at bu, Evan Weinberg, all were also work for NVIDIA and uh Kate. Matthias and Evan also uh spend a lot of their time on Cuda, but not all of it.

A

uh My work to support staggered quarks, which weren't part of the original Cuda project was done with guo chanchi when I was on sabbatical at NCSA for the Blue Waters project turns out uh that project changed a lot, um and um that was the best thing that I got out of it, starting the cue to work so Cuda originally only supported Nvidia, but it's been generalized with back ends to support hip sickle and open MP I've run hip on Crusher at Oak Ridge, so our community really benefits greatly from Cuda and I'm, not sure other areas of science have a similar thing going.

A

uh Speaking of Nvidia I got this from Geek one two this morning, uh who is I think he went straight from graduate school to Nvidia, but he also works mainly on the domain wall fermions. So this is uh which have a Fifth Dimension. So this is Mobius uh domain wall. Fermion 64 cubed by 96 by 12.

A

Blue, is before the upgrade to from slingshot 10 to slingshot, 11 and red is after, and he told me in his email and said here he was able to get a 64 GPU run done, which was 30 percent faster um than what's shown here, but he didn't have a chance to get the other runs done, I think or he could just couldn't, get to Pearl Mudder, because it's in maintenance. So that's that um getting a little bit short on time, so performance portability. We've heard a bunch of talks today about different approaches.

A

To me it hasn't been clear for quite a while what the best portable performance approach is. I mentioned some here Cocos we heard about this. We heard about openmp, DP, C plus plus hip last week, I learned that hip is going to be supported on Nvidia uh and uh probably you know, at Earl, mutter and also on Aurora, which is a very interesting uh there's, been a series of meetings from the doe on performance portability.

A

um I encourage you to use Google and find some of those and on the right is the cover of a special issue of computing and Science and Engineering on performance portability for advanced architecture that I co-edited. You might find that interesting uh to look at and I've got to my conclusions. So Pearl Mudder is a very powerful platform for scientific Computing.

A

If you've been using Corey GPU, the transition should be easy if you're, just starting out with gpus I, would suggest spending some time studying the different approaches to gpus before you commit to a forwarding strategy and finally have fun. But please leave some time for us, I.E the lattice qcd physicists. Thank you very much for your attention.