National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Massively Parallel MD Simulations using NAMD

Description

David Hardy (NAMD)
Massively Parallel MD Simulations using NAMD

A

Okay, so I'll talk about uh some of our work uh that uh has been run on chromeutter and um also some of the development efforts uh that have gone into namdi to uh utilize stems GPU nodes effectively.

A

uh So the MD is a scalable molecular Dynamics program, it's uh getting close to 30 years old now, um and uh the emphasis in the MD has traditionally been in parallel scaling of large systems, but uh I'll talk about the work today that has brought this back to um running very fast simulations on uh single gpus or a single node uh gpudence architectures.

A

And so the the first project I'd like to highlight uh was a Gordon Bell finalist in uh 2021, um with a hydatristian, uh definite gorgon and uh the pi on this was uh arvind ramanathan from uh Argonne and uh This was um an interesting study uh about the replication transcription complex um that is responsible for uh replicating and transcribing the um the viral uh mRNA in uh in inside uh a human cell, and so this involved um multiple levels of modeling where uh we started with cryo em data, and then um we used uh an intermediate Continuum model uh called a fluctuating finite element.

A

Analysis to uh try and resolve these. um uh These, these uh rather low resolution, cryo-em images into something that we could then uh run with all atom, molecular Dynamics, and uh so in order to drive this and make it even faster.

A

um We used uh an AI steering, workflow uh and uh so I'm just going to take you through the slides or the the transitions here, real fast um and, uh and uh the idea was that we could use this to um uh more quickly, um resolve images from the ffea side and uh then use these to actually feed back the entire uh uh modeling process, and uh so on the uh all atoms side. We were using uh our GPU resonant version of the MD that you know we're released this amd3.

A

And um what was interesting about this is that this was actually a workflow that was working uh to join uh multiple compute sites together, um including Pro Mudder, with uh the Theta and Theta GPU, and also a special AI test bed uh at Argonne, okay, and um so the idea was to have um this asynchronous workflow. That then, was was driven by uh AI to steer the the choice of simulations to to sample the the more interesting and underrepresented uh areas of uh the the entire uh conformational space.

A

And so um when we did this live, we were actually using uh compute nodes. The promoter, together with compute nodes of theta GPU, and they both have similar architectures uh Pearl, monitor, has uh four uh Envy linked a100 gpus per node, a Theta GPU is actually assembled out of dgx a100s, and so those are eight a100 gpus per node and uh connected by an Envy switch.

A

And uh so the other interesting thing I'd like to call out is um the um scaling uh performance here that we got out of uh the gpus uh versus uh what we could get out of a traditional cpu-based system.

A

So here this, this plot is actually showing the crossover point in performance uh between uh for for a 1.1 million atom system, and um we so I've indicated a horizontal line here where uh you know that shows the performance on attack Frontera at 128 nodes, and we see that three gpus is giving equivalent performance to um 128 nodes of of a very fast cpu-based system. So so that's significant.

A

There's a a lot of uh computational Power in in these uh dense GPU uh supercomputers that you know we you know, are continuing to develop our uh code to to unlock.

A

So a second project, that's been running on promoter is uh uh being done being pursued by Aaron, Chan and manly Cedar, um and uh so this is studying um what they call here. The most abundant photosynthetic organism on Earth, uh the uh prochlorococcus, um and uh so here uh the the idea is to model the rate limiting steps in uh this, this uh energy uh producing system and uh and in order to achieve the time scales that they need to.

A

On this um they, you know the the my understanding is that the setup uh of the the all atom data here was done with namdi, but then they switched over to a course: grained uh Martini force field representation and so uh they're they're using chromex, which you know implements this. This really well and um uh they're uh running, uh just a single copy of this uh per GPU uh and and so they've got an ensemble of of these systems that they're they're running together in parallel.

A

um But you know this course querying representation which reduces the system size to just 1.5 million um particles.

A

um They can run this at uh 40 nanoseconds uh per day, um and uh you know for for aggregate. uh uh You know uh time scales of uh 25 microseconds, so so this is, you know a lot of sampling that they could do by switching over to this. This representation.

A

So um now I'd like to go into uh talk about the um development work that we've done uh to make namdi really fast on gpus and, of course, I'll start by just talking introducing uh molecular Dynamics and that we're integrating Newton's equations of motion. So uh we have to do these steps sequentially, um but uh we have a lot of calculation that we need to do to calculate the forces of each stop, especially the non-bonded forces.

A

And so, if we look at you know uh the the parallel distributed uh the distributed parallel workflow of the MD and uh here take a look at how gpus are introduced.

A

um Well, what classic Miami has done has been to offload the force calculation to the GPU, where we have to aggregate enough position data together to to make these GPU kernel launches, uh meaningful um and uh and uh We've.

A

We've been incorporating gpus since, uh since Cuda was first uh released, so back in 2007., So, initially gpus weren't, nearly as capable as they are today, and we started with just uh calculating the non-bonded uh uh work on on the GPU and eventually we also ended up calculating the bonded computes and uh also the scalable parts of the pme calculation, the charge spreading and the force interpolation kernels.

A

For that, um and um in addition, uh we've got, uh you know the mechanism and the AMD so that, if you're only running on a single GPU or a single node that we can calculate uh the entire, uh you know uh work for pme on on a single GPU.

A

But you know in when we're looking at the parallel calculation. um You know the the original idea was to use the GPU offload scheme and we're partitioning work between the CPU and the GPU. Where you know the force, calculations are uh being done on the GPU and then the um the um uh remaining parts on the CPU include the integrator, rigid, Bond constraints and possibly whatever uh you know, enhanced sampling methods.

A

You know that you might be using, and this approach uh worked really well up until somewhere around the release of uh the Pascal generation and the voltage generation gpus. When we found that namdi's GPU offload approach was being increasingly CPU bound, um the the CPU, the work remaining on the CPU was becoming a bottleneck now schematically.

A

The idea here is that the GPU offload kernel, um you're, launching a force kernel here to calculate the forces, and then we had a mechanism in namdi where we could write back real, fast to host, pin memory, and meanwhile the CPU uh is uh they're busy waiting for to see if, if forces are done and as soon as forces are done, it can start into then integrating the next time step with those forces.

A

So that's how we were getting overlap between the CPU and the GPU and uh as gpus became uh faster and more capable um again we're. We were seeing significant gaps in how we, how much utilization of of the GPU the effective utilization of the gdpu that we were getting, and so um uh you know, then our approach was to develop a GPU resident version of PMD.

A

Where now the the atomic coordinate data all lives on the GPU between time, steps and you're basically have moved all of this uh atom integration, work and all the related stuff onto the GPU, and you end up having very little uh CPU uh work being done at this point.

A

So uh we can see here now in in the profiles after the GPU resident version that we're using uh the gpus quite effectively and uh our timings bore this out, where it was in some cases more than doubling the performance of uh the GPU offload version.

A

And so then our next step was to extend this to GPU, dense architectures. So you know, think of you know us leveraging dgx like architectures, like you find as the nodes of pearl mutter or uh you.

A

You also have that to some extent Summit, and then you know it also in in the uh Frontier uh computer, and um so here the idea was that you know we're now decomposing the entire problem between uh you know several gpus and um and uh but you know, there's going to be some communication required between the gpus and so um what we have to do is then, uh within each time step is we're going to have to do a position multicast to populate these compute objects that are then being calculated on on their respective gpus and then Force reductions back to you know the gpus that are holding the patches, and so this means the gpus need load, store memory access uh between these different devices within every time, step.

A

Okay, so that's why we need um a system with, with these fast interview, link connections between the gpus um and um since the original version of this we've actually done some work to uh improve the performance and scaling um where we've done some things to mitigate uh some of the work. That's still left on the CPU that includes this uh so-called at a migration step. That's updating the domain decomposition.

A

Also, we find that pme, uh you know, can cause quite a scaling bottleneck as well, and so we have some some things in place to mitigate uh that that those issues from pme and um so here's a performance plot now um with a a one million atom Benchmark system that we, you know uh regularly use to. uh You, know benchmark performance of the MD and, um and here we're showing that you know the versus the original version were uh scaling.

A

Quite a bit better, um for you know, say like a full dgx a100 uh again uh promoter is effectively. You know, half the a note of promoters, effectively: half the the computational power of of uh dgx a100, because there's there's just the four nodes and- and we were already scaling quite well out to four nodes, um but now we've these these optimizations help to improve performance as well and um so just to take a look at a larger size system.

A

So this was a um a uh SARS cov2 coronavirus, Spike protein system here of eight and a half million atoms, and uh we see that we're uh doing quite well and and uh the the new code is uh faster than you know, running this on on 64 nodes of a frontier out here in this. For this particular simulation.

A

um So some of the other work we're doing in Namby is uh to support the upcoming exascale supercomputers at uh Oak, Ridge and Argonne, uh so for Oak, Ridge um that these are um AMD, gpus and uh AMD uh went the route of developing uh hip.

A

That has a very close correspondence to Cuda, and so it turns out that you know it was a little bit of work, but you know we can basically uh have a hipified version of our code um uh based on our original Cuda kernels, with just a little bit of extra um uh pre-processing macro magic and um but otherwise we can leave the the code path generally unchanged um now for for the uh for Aurora.

A

On the other hand, um this this is uh going to be based on Intel gpus, and um so uh our approach here is to uh implement this in sickle, and we do have some automatic translation tools that can you know, take our Cuda kernels and turn them into sickle code. But it's not something that we can really support right now on um the the same GPU code paths. So so this is really effectively giving us a yet another GPU accelerated code path through named.

A

um You know, that's that's something that you know we're going to have to. You know deal with eventually, um but anyway, um that's it's been a longer process. It hasn't been as easy to Port an amd2 aurora.

A

At the moment we do have all of our GPU offload kernels ported, and we still need to Port the GPU resident parts of namdi, okay, so I'd like to end by acknowledging uh particularly the the various uh you know, people that we've either had here at the University or our you know, Partners at you know different companies that have had their hands on the GPU parts of named.

A

So thanks and I'm happy to take questions.