National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NESAP Project: WDMApp

Description

Aaron Scheinberg
NESAP Project: WDMApp

A

Yes, um so this is a project which is a exascale project as well as XGC standing alone, and we've been doing a lot of development uh and already science work on Pearl mentor with a great team all over the country.

A

So XGC is a Tokamak plasma physics code, which specializes in Edge physics and in realistic geometry.

A

It's a gyro kinetic code, which means that we're taking the 60 plasma physics and reducing it to five Dimensions using a technique called gyro averaging it's a particle in cell code, which is on an unstructured 2D grid in the polo cross section which you can see on the right here. So we have this unstructured grid and that's what really enables us to do that realistic geometry and then on the toroidal dimension.

A

We use a structural Grid, it's a series of planes and then we use domain decomposition where um there's these toroidal slices and then each MPI rank uses um is in charge of a subset of the grid.

A

um So these alternating fake colors show the different MPI ranks and the basic algorithm is that you have particles that scatter their charge onto the grid and then using that information you can solve for the electric field using the electric field, you can push the electrons and the ions to their new position at the next time. Step and electrons can be pushed multiple times for each field solve and because we do this, what we're calling subcycling the electron push ends up being our most expensive kernel.

A

Usually, then, we have to shift the particles to the next domain and then there are a number of other operations that get done after that.

A

So the whole device model or wdm app is an exascale Computing project and the idea is that XGC is specialized to work on the edge of a plasma, and so we can couple actually see with a core code like gene or gem, and that enables a whole device modeling and the vast majority, like over 90 percent of time, is spent in XGC. So its optimization is most critical and that's what I'm going to be talking about.

A

There's a number of engineering challenges that come with working on XGC, there's a wide array of physics, features and modes that have to be supported. So when I'm trying to optimize to run on any sort of new machine, we have to keep in mind. We have Delta, F or full F modes.

A

We have electrostatic where magnetic field perturbations due to the plasma itself, are ignored and electromagnetic where they are included, XC, symmetric versions and impurities version, neutral particles and then coupling as I mentioned, and all of these different modes of operation can drastically alter the landscape of performance bottlenecks.

A

On top of that, we're constantly developing new physics, and while some of those are modular editions, others are more structural modifications and so in the attempt to develop the code for a specific architecture, we have to keep in mind that we also want to optimize these future developments.

A

Our Target architectures are, of course, Pearl Mudder and Corian Summit and then Frontier in Aurora in the future, and an interesting thing is that, uh of course, they all have their different Native languages, with Cuda as the Nvidia language and hip for Frontier sickle.

A

For Aurora, but they also have different amounts of host memory and device memory, and because of this, some calculations are better off being distributed among compute nodes on Quarry, where there's less memory per Rank and slower computation, but they shouldn't be distributed on Pro Mudder, where the Nega, the computation time, is negligible and there's more memory available in other situations, it might be better off to store data on the device and store data on the host if there's not enough room on the device.

A

So the challenge is having a single code that can do those things.

A

So we opted to use Cocos and also convert our Fortran code to C plus plus, to enable this so Cocos is a portability abstraction layer that maps to these different languages and allows us to not have to worry about that. So before 2019 XGC was a Fortran code with three versions of our dominant kernels. We had open ACC cuda4tran, as well as a vectorized CPU version and a simpler reference version.

A

um Briefly, we tried to have a sort of a hybrid code where we used Fortran but um used wrappers and macros to still be able to offload with Cocos, but that wasn't really tenable in the long term, and so currently, we've moved to C, plus plus code, although there are some components: stolen Fortran.

A

So our approaches have been portability with Cocos, a major focus on encapsulation and modularity. A lot of templating. For example. The electron push in the on ION push are quite different, but we're able to use the same code for them, and this makes it a lot easier than Ford to experiment and to swap out with different options.

A

We also have each of our major code components able to be run independently and they're using the same code base, so they're not copies, and that means that anytime, someone is working on these individual components or if we give them out to vendors, to look at at a specific architecture that they're never outdated. They don't require extra maintenance and any work done on those kernels can immediately go back and benefit the code and we've also really revamped our testing and our CI. With unit tests. Regression tests run tests and automated physics testing still in progress.

A

So, as I said, the original attempt was to keep computation kernels in Fortran um with wrappers, and this was a feasible strategy for Cuda Fortran, but it didn't make much sense for AMD or Intel gpus. It also limited our functionality and our ability to fully utilize Cocos.

A

So, instead of using this sort of small C, plus plus Cocos interface layer, we eventually moved to just using C plus plus- and this did add a lot of work because it meant we had to convert the entire code.

A

So in a sense, this is like replacing every part of an airplane while in flight, because an alternative that we could have done was to write a c plus version separately from scratch.

A

And if we'd done that, then we wouldn't have to worry about a lot of the temporary scaffolding that we put into place for Fortran interfacing, and we could have developed things a lot faster since correctness wouldn't be critical to the current production.

A

uh On the other hand, we decided to opt for this mid-air replacement, because that way we have a single code base and the maintenance and improvements that are made to benefit the current production code as well, and we're also able to keep the code up to date with new physics capabilities, and already this has happened since we've been doing this conversion. We've added electromagnetic physics and multi-species physics successfully.

A

So one interesting question that has come up going specifically from Summit to promuder is particle memory management because on Summit we currently have our particles resident on the host and every time that we want to do an operation on the device. Then we send all of the particle data to the GPU, run the kernel and then bring the results back.

A

So this means that more particles are possible on the uh in general, because we only need to fit one species on the GPU at a time, but it also adds extra communication time and on Perl Mudder, there's actually so much more space on the GPU that we can just leave all of our species on the GPU. And that means that all that communication time is eliminated uh and then we've also been looking at asynchronous streaming to try and hide that communication, but um but on promoter, that's not necessary, but we do have to maintain.

A

Basically, these different capabilities, depending on which machines we're using foreign.

A

So this is our Summit performance um going from January 2019 up to the while time flies. So there's a bit of a newer there's been progress over the last year, but this goes up to November 2021.

A

uh So this is an electrostatic simulation and basically just want to show. We have steady Improvement because we've taken more and more of these kernels that are on the left, indicating that they're on CPU and offloaded them to the GPU, and as more of that has happened, then we've gotten a big speed up foreign.

A

We also have uh done weak scaling studies on the entire machine of summit and we've made a lot of improvement there up to present day.

A

But one interesting thing is that our weak scaling actually gets worse because, as our computations are all offloaded, then the communication becomes relatively a larger part of our total time. So our particle shift we're removing particles between nodes, as well as our electric field. Interplanar gather suddenly become a lot more important.

A

So, comparing Pro Mudder to Summit. This is now an electromagnetic simulation and we're comparing the x-axis is number of gpus. So it's one V100 versus an a100, going up to 256 Pro motor nodes and we had a 2.1 x speed up compared to Summit again just done a GPU to GPU basis.

A

uh As I said, the electromagnetic simulation the electron push kernel is less important because it's subcycled fewer times, and so this really changes the um the most important kernels that need to be optimized and we're still working on optimizing. The electromagnetic version, as opposed to the electrostatic version.

A

We've also scaled Pro Mudder up to a thousand uh prometer nodes, and this is our weak scaling. It's not as good as on Summit and I think there are a number of ways that we can improve it going forward with GPU, aware and pi, and then also with putting more articles onto each node and basically just packing a larger problem size than we had been doing.

A

This is key because particle operations are well suited for the GPU, so they are well optimized and we need to get a lot of particles on each GPU and by increasing the amount of computation versus MPI communication that can also improve the weak scaling.

A

So I also wanted to compare against Corey K L, and this brings up a question of how best to compare two very different machines where you have different memory amounts layouts, as well as the flops and the just general architecture layout.

A

So one option is to run the same simulation but packed into as few nodes as possible on each machine. So I took one simulation that I could run on 64, Corey, K L nodes and that total memory of 64 query k. L nodes is six terabytes um I put this onto 16 Pearl Mudder nodes, which is a bit more memory available, but it's basically just keeping the nodes packed using as large of a problem size as I could, and so just based on the theoretical flops of those two um sets of nodes.

A

You would expect that the 16 Pro Mudder nodes would be 3.4 times faster than the 64 query, k, l nodes um and in this particular simulation that I ran it actually was almost nine times faster.

A

So that's indicating for this particular electrostatic simulation we're utilizing promoter resources actually 2.6 times better than Encore KML, although that might just mean that we weren't using query k at all very well- and this also varies depending on simulation type, so electromagnetic simulations again, not so GPU optimized, yet I also want to mention that on the ECP test beds, we're also getting good performance.

A

This is a comparison of summit versus Crusher, where we're looking at one V100 versus half of an mi250x and we're getting good performance there using the hip back end and also on Aurora using the SQL back end we're getting good performance, but that's NDA.

A

So our portability strategy seems to be working pretty well. Foreign.

A

We've already been able to do some Physics on Pro Mudder and just to Prime this. So in many tokamax, the exhaust from the plasma is directed along this separatrix line on the outside and then this plasma goes and hits the diverter at the bottom of the Tokamak. In this case, and so the diverter must be prepared to handle a very high amount of exhaust of heat from this exhaust and naturally, a wider impact area is going to be better to spread that heat across a wider surface area.

A

So the diverter heat load width is an important parameter that we're trying to study and trying to estimate and has a major role and a viable Toca Mac foreign.

A

This figure shows um as a function of ploidal magnetic field what the diverter width is expected to be based on some empirical scaling laws and.

A

We would essentially expect is that either will actually have a very thin diverter heat load width, and this was a concern and using our electrostatic simulations XGC has been able to match the diverter heat load width for other tokamax, but its predictions for eater actually suggests that it'll be in a new regime with a higher than expected diverter heat load width, which would be good news.

A

uh Furthermore, our fully electromagnetic simulations that we have run on Pro Mudder suggests that electromagnetic effects will actually increase this value even further.

A

So this is a visualization of a simulation that we ran on Pro Mudder, and this is demonstrating the presence. These structures here are called homoclinic Tangles, and this is not uh something that I have presented before so I, don't know exactly what a homo Clinic tangle is, despite trying to trying to read up on that, but um these structures have been observed in the simulation and the the end result is that you have a much more diffuse stream of plasma from the x-point to the diverter than previously thought.

A

This isn't the first time that they've been observed, but usually it's under only special circumstances, and this opens the door to possible interventions that could strengthen these Tangles and thus enable us to further increase diverter heat load width in a real Tokamak foreign.

A

There's more visualizations of what we've been doing on Pro Mudder and enabling this science and this new understanding and the predictions we're able to make about averter heat load, width.

A

So to conclude, XGC is running on Pro Mudder and it's generally performing pretty well, there's still plenty of work to be done, especially in the electromagnetic mode. We need to offload some more kernels. We need to improve some MPI communication and load balancing and continue to keep up with new physics Editions, uh but Pearl Mudder is already enabling XGC simulations that are providing Insight on electromagnetic fusion, plasma behavior and making predictions for eater.