National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Massively Parallel PIC using WARPX

Description

Andrew Myers (CRD, LBL)
Massively Parallel PIC using WARPX

A

Okay, all right, thank you uh very much for giving me the opportunity to talk. um I assume everyone can see the slides. Yes,.

A

So yeah I'm going to talk about warpacks today um before we start I want to note a ministerial show, some scaling results from Pearl, Mudder and also from Frontier, and these are pre-acceptance so just generating that I expect that every Randy's in a month they might look different um I also want to share. So there's a lot of people who work on warp X. This is kind of a snapshot of what the current team looks like um it's a multi-disciplinary teams, there's people with like physics, applied math computer science backgrounds.

A

It's also a multi-institutional team.

A

um Most of us work here at Berkeley lab, but there's also people with Livermore and slack there's a number of collaborators from European facilities like ceasa, clay, uh Desi and CERN, and there's also a growing number of collaborators in um an industry that are using work bags and also contributing a lot of great things back. So we appreciate that I also want to mention so warpacks is one of the nesap codes.

A

So we very much benefited from you know, having access to nurse expertise and also Early Access, to like Pearl Mudder and in particular, I want to mention Kevin Gott is our Nissan liaison and there's also a Nissan postdoc named Michael Rowan that worked with us for a couple years into a lot of great work.

A

um So just some I guess background and motivation. So the main application of warp X is for modeling um particle accelerators, and uh you know when I and I think when most people think of particle, like so accelerators, they think about something like the LHC at Stern.

A

um So this is like a giant building size, piece of equipment, I think uh it's like 27 kilometers in circumference, or something like that um and it accelerates particles to Fantastic energies and it's used for like Discovery Science and in that it's been very particle. Accelerators have been very successful in that and have enabled a lot of the Nobel prizes that have been awarded in both physics and chemistry over the years.

A

um But there are other applications of accelerators as well um medical applications. For example, there are 9 000 medical accelerators in operation worldwide. These are used for things like radiation treatment for cancer or in the production of medical isotopes, um there's also about 20 000 industrial accelerators that are used in various thing capacities, semiconductor manufacturing, sterilization of food and there's also a number of like National Security applications as well. um The annual value of all products that use accelerator technology is estimated to be 500 billion.

A

um So the point we're trying to make is the opportunity for there's a bigger impact. There's opportunity to make a bigger impact of particle accelerators by reducing their size of costs and modeling plays a role here, because it allows us to explore and understand the underlying physics and also Aid in just tuning the design of specific prototype accelerator designs.

A

So the next generation of accelerators needs the next generation of HPC modeling tools. uh So a potential Avenue for improving uh on the size and cost of accelerators relies on this plasma. Acceleration idea. So there's a couple ways of doing that, but you can fire either a laser beam or a particle beam through a plasma. It transfers energy and creates these electric fields in the plasma that have this weight field configuration and then there's a beam of particles traveling in that wake that can get accelerated to high energy and as of 2019.

A

uh The record for this was demonstrated that they could generate a 8GB particle beam in a distance of about 20 centimeters, which, if you compare that to what a conventional accelerator would require. uh It's like one to two miles or something like that.

A

um So uh that's nowhere near the energy that led like the LHC, but if you want to build something that could um accelerate things to like the multi TV scale, the idea would be to chain a bunch of these individual stages together and you would need to like um the laser case. Depleted of energy you'd need to like inject new lasers after every stage and there's a bunch of tuning. That needs to be done to make sure the beam quality is maintained as it passes from one stage to another.

A

But that's that's the idea and so modeling. Each one of these stages is computationally intensive and to do the whole device you would need to model. You know 100 of those and you need to be able to do large ensembles of those calculations, so this really involves kind of hexascale uh resources.

A

um So this is like kind of warp, X's challenge problem um and so far we've I think done. The first uh simulation of this kind of modeling 10 stages of a laser weight. Philadelphia accelerator you can see here this is uh in situ um visualization I believe this is showing the transverse electric field ISO Contours of it, and then it's also showing the particle beam colored by the longitudinal momentum.

A

um So this was an in-situ rendering done using Ascent Plus btkm and we were able to do a convergence study of this uh using three to 768 gpus per run in the convergent properties looked look nice so that that's kind of the war of X. You know challenge problem there, um but on top of that there are a number of other uh application areas uh and it's like it's a growing list. I guess so.

A

People are using warp X to study laser, ion acceleration um to look at plasma confinement for Fusion devices, modeling microelectronic devices, thermionic converters um there's also an effort to apply workbooks to astrophysical modeling, um so yeah. So, although it was designed kind of with particle accelerators in mind, it's a general paint code and can be used for other things as well, um so just sort of an overview of the warp X code, then um so or pick code.

A

So we have macro particles that represent collections of like electrons or positrons or other charged particle species and there's also a mesh on which we store the electromagnetic fields in the current density and charge density, um so warp X on top of the basic pick algorithms, it implements a number of advanced features, so there's the ability to operate in a Librarians boosted reference frame.

A

There are high order, spectral, solvers support for embedded geometries support for mesh refinement. Etc um there's also a number of uh multi-physics modules that come in via the Pixar Library. So this models, things like field, ionization, coulomb, collisions um and QED processes like hair creation, for example.

A

uh We support uh 1d2d and 3D Cartesian geometry and also have support for a RZ quasi-cylindrical mode like this. In terms of the parallelization, we use a hierarchical approach. So uh there's like an NPI level where we have different boxes, and this is like 3D demand decomposition. Those are divided on the different API ranks and we can also do Dynamic load balancing by shuffling those boxes around us. We want it and then on a given node.

A

um What happens depends on whether we're compiling for GPU or CPU execution for GPU? We have support for Cuda, hip and sickle back ends, and then we also have support for an openmp uh backend if we're doing multi-threaded calculations, unlike mini core architectures and finally, there's support for a couple different kinds of scalable parallel. I o formats and also support for the in-situ Diagnostics.

A

um So to to deport warp X to gpus and to achieve performance portability we're using the amrx library, this was developed as part of the ECP X Club Computing project, um in addition to the performance portability, it also does handles things like demand, decomposition, um MPI communication. So when you do go shell exchanges or uh particle redistribution in that sandal via amrx, it also provides tools for the mesh refinement aspect of work effects and also tools for getting the dynamic load balancing.

A

um This is the way the sort of data structures work uh on the GPU versus the CPU, so on GPU on each box we essentially are launching um Cuda or hip or DPC, plus plus kernels, and the threads are mapped to kind of either the different cells in the box or the different particles in the box and process them. Concurrently with openmp, we have an additional kind of layer of parallelism. We support that.

A

It basically does logical piling, so this allows us to process the boxes in a more cache friendly way and also enables using openmp to process different Tiles at the same time.

A

um So the bulk of the support for gpus is done through these parallel floor routines. So these are part of amrx, it's similar to what is provided in like Cocos or Raja, and that the work you're expressing is done via this Lambda function right here and the idea is depending on how you compiled the code, it will specialize it for a specific platform.

A

This is a this is the result of two scaling studies: I, guess that we did on the summit machine so one just using the power9 CPUs, all the cores available on the nodes and another using the V100 gpus they're available in some some has six gpus per name.

A

um So what you're seeing here is both the scaling of the code, which looks quite nice up to um 2048 nodes and also the benefit that we get from running on the gpus, which I think this is like a factor of 30 or something Improvement on this problem. um I should say so. This is these.

A

These results are for V100, but if we run the same thing on a100, we get an additional it's almost a factor of two Improvement uh comparing D100 to a100, um and that was nice because it was a fair bit of work to Port warp X to use that you could use, but once we did that and had it running well on Summit it just it just ran without any code modifications on Pro matter, and it was almost twice as fast. So that's that's nice.

A

um A few other things we get from amrx uh the parallel linear solvers for solving the poisson's equation, um embedded boundary support and there's also a runtime parser for user, provided math expressions that can run the the CPU and the GPU that we use in orbex.

A

um So a bit about the porting to GPU process. So in order to use these parallel four routines, we had to Port the kernels and warpacks from um four grand to C plus cut or that backwards. In the slide we had to Port them from four grand to C, plus plus so on the left. This is what a routine. This is one of the finite difference. Solvers that's available in warpacks, updating the electric build in the y direction.

A

This is what it looked like in Fortran, and this is what it looked like in C, plus plus the the original warp X code was a mix of C plus plus and Fortran code, and the fourth Grand was kind of for the computationally expensive kernels that, like crunch the numbers, um so I guess there's two points. I want to make. What one is that other than so other than this um Loop over the cells?

A

The actual update that does the math here is almost exactly the same between the Fortran and the C plus plus, and this is facilitated by this uh amrx array. 4 multi-dimensional array. That's just designed to be used like as much like working as possible, and um you know we did comparisons of the overhead between this C plus multi-dimensional array and Fortran, and it's it's basically nothing on on a CPU execution.

A

um So there was some upfront cost in development time for doing the support. But now we have a single code base that works for NVIDIA, AMD and Intel gpus and still supports mini core and again once this process was made for V100. Getting the code to run on a100 and on the mi-200s that are on, like pressure was um relatively pain-free.

A

um So I want to talk a bit about so warbex is one of the finalists for the uh Gordon Bell prize this year and um I wanted to so they let us on Frontier in order to do some bigger, runs and see how the goods scaled there. And then we also had access to Pearl Motor through the Nissan project, um and they were also able to do some big runs on both fugaku and Summit and what the simulation was. It was basically one of those plasma acceleration stages. So it's a setup kind of like this.

A

You have to lay their pulse. You have the Wake field, there's a beam of particles behind it um and what we got was. We were seeing quite nice weak scaling um on basically up to the full number of nodes that are available on these machines um on Frontier in ugaku, we're getting about like 85 to 90 percent, weak scaling efficiency and then on Perl, Mudder and Summit. It's maybe more like 75, um but that's at like almost the full scale of the machine.

A

um So, in order to kind of compare performance across different machines, we use this figure of Merit, which is basically a measure of the number of particles that you can update in a given unit of time, and here the figure of Merit goes up. If either you run a bigger problem in the same amount of time or you run the same problem faster, it's kind of has both of those things rolled into it.

A

um But what this chart is showing is kind of our progress in this measure um over time, and you can see so in transitioning from the original warp code to uh the warp X, which is the amrx port. um There was a nice Improvement there and then in transitioning warpacks from Corey to take advantage of the gpus on Summit. There was again a nice Improvement, and then you know over the years we made some optimizations I think there's also some times where we go backwards in here too.

A

But um you know, overall, the improvement over our in over the pre-ecp Baseline uh is about a factor of 500, comparing what we got on Frontier to what we got with the original work code on Corey, and it's also about a factor of 100, just comparing where we were in 2019, with warp Exxon Curry to what we saw in Frontier now um so yeah I like this, because it kind of shows uh there's machines with AMD GBS, there's machines with Nvidia gpus and there's also mini core machines as well.

A

um Another thing I wanted to mention is so we sort of uh you know it implied that you could just use this parallel. Four constructs and Port all your kernels that way, and for the majority of the kernels and warfx. That is all we did for some, particularly like performance, critical kernels. It's worth it to do some extra tuning and an example of that is uh in the kind of core particle mesh routines and warp X, the ones that do the current deposition in the field. Gathering.

A

um What we found was so first of all, both of these kernels are heavily memory bound, um but if you don't do any particle sorting or anything like that, you're bound by the bandwidth between hvm and the uh processors, if you do occasionally sort the particles in their cell order, so that they're ordered a memory the same way.

A

The cells in the mesh are you'd, get a significant amount of cash reuse and you go up to being limited by the L2 bandwidth instead of dhb in the bandwidth, and then it makes a nice Improvement in the overall runtime.

A

um It depends a little bit on what the problem is, but it can be like a factor of five to ten between doing this occasional sorting and not doing it.

A

um We also have a shared memory version of the deposition algorithm that gives us an additional 40 improvement over is kind of the best. uh The the best non-shared memory version that we can do on a100, and it's actually even it's significantly better I, would say on the Mi 200 gpus.

A

um And I believe that I'm getting short on time, so I'll skip the next couple slides. This is just saying that all our development takes place uh on GitHub and we fall on open source development model in our documentations online as well, um and with that, uh thank you and I'd be happy to answer any questions.

A

Thank you very much. um I think there is a couple of questions in the chat. You could just look at that. One.

A

So there's a so there's a question is: is warp X capable of hybrid CPU GPU parallelism, so we do kind of the full Resident on the GPU approach. So when we compile for gpus, everything runs on the GPU um and.

A

Then, okay, what is the most costly step of a basic warpack simulation is usually the current deposition, um depending on how many, if you have a very low number of particles per cell, um you can get in a regime where the communication costs are the highest thing. But assuming that you have a good number of particles, it's current deposition and then for what you use to solve for the EM fields.

A

There's a couple different. uh So there's there's a couple different finite and different solvers. There's the e-solver and the Cole harkinen solvers, implemented in warp X and there's also a spectral method called the pseudo-spectral analytic time domain method. That's an option as well! So there's a few different choices you have for solving for the for the deals.