National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GPU Accleration of the WEST Code for Large-Scale Full-Frequency GW Calculations

Description

Victor Yu & Marco Govoni
GPU Accleration of the WEST Code for Large-Scale Full-Frequency GW Calculations

A

Oh hello, everybody. uh First of all, thank you very much for having me here uh so I will be talking about the GPU parking of the West code, which is a code for large material simulations based on many body per division Theory.

A

So this work was done with Marco cavoni and we are both from the material science division of Oregon, National Lab.

A

So to put our code development into context uh in our group we use first principles simulations to study electronic structures for both ground state and excited state properties, and we are particularly interested in uh not only small molecules but also very large and heterogeneous systems. This is motivated by our Target applications, for example, nanoparticles for energy harvesting and solid liquid interface for water splitting, and also spin defects in semiconductors for Quantum Technologies, so soon on.

A

This page are just some examples of the nanostructures and materials which we would like to simulate at the many body per division, Theory level. So to do that, we have developed a code package called West, which means without empty States. So it is a parallel implementation of metabolic participation Theory, including GW and BSE, and uh but what distinguishes West from a conventional GW code is that it uses a formalism that does not require any submission over empty orbitals.

A

So this summation uh is quite expensive and it's avoided invest and it does not require the storage or the inversion of very large matrices which are widely used in conventional GW codes. So the CPU version of West scales very well on uh cpu-based supercomputers. So, for example, the largest GW calculation performed with must consists of over 2000 electrons.

A

So the functionalities of the code is summarized on this plot, and so it's capable of computing, GW and electron bonus of energies and in addition, it can simulate excitation processes using density, Matrix, perturbation Theory, which includes BSE and term dependent density, functional Theory and also it has a Quantum defect. Embedding Theory, which targets uh strongly correlated defect States in semiconductors.

A

So the code is written in a fortune and it's interfaced with other electronic structure codes through a python layer called West pi and also through a widely used data formats like Json, XML and hdf5, and also it leverages Community Frameworks like tensorflow, for machine learning and also kisket for simulations on quantum computers.

A

So, finally, this code is parallel. It has MPI Plus on 10p and the CPU version scales to over 500 000 CPU cores, and uh last year we uh ported the West code to Nvidia gpus. So it's tested and benchmarked on a number of GPU enabled supercomputers, including uh promoter at nurse and also Summit at Oak Ridge. So our parking strategies are the following. So wherever applicable, we try to use a window, optimized GPU libraries for the linear, algebra operations like Bluffs and fft, and when four Loops that cannot be done by existing libraries.

A

We use a directive based approach, uh hoping that it's more portable than lower level languages. So we started from a Cuda Fortune which are super easy to write, but it's not portable and then in the latest release of the code. We transitioned from good unfortune to open UCC and, as a next step, we are switching to openmp, Target offload, which should work on Nvidia, AMD and Intel gpus.

A

So in terms of achievements, uh we are able to get a big speed, speed up by using gpus over the CPU, only version uh which I'll show you in a minute and uh the code scales very well on GPU super computers, including Summit and promoter, and data GPU, and uh one of the largest JW calculations we have done using the GPU versions actually consists of over 10 000 electrons.

A

So next I will talk about some examples of how gpus are being utilized in the west code, so shown here is perhaps most important algorithm in the west code, which is the so-called beta program, so it computes the eigenvalues and vectors of the static dielectric matrix by a iterative algorithm.

A

So so the code here has multiple Loops, so the outmost loop is sequential. It cannot be done in parallel, but also inter inner Loops can be done in parallel. So in the CPU version of West we had two levels of parallelization, so in the first level we distribute the perturbations which are more or less independent from each other.

A

We distribute them across images which are nothing but subgroups of the MPI ranks and then in the Inner Loop we distribute the plane wave components across MPI ranks within a image and then a lecture here for each image we perform parallel, Fourier, transform and other in algebra operations in parallel. So this works very well on CPUs, uh so naively we were thinking. uh Maybe we can uh for each MPI rank. We offload the work to a GPU which would lead to a picture like this.

A

So now we are Distributing the plane wave coefficients across multiple gpus and therefore we are performing ffts on many gpus, but this does not work because parallel fft is required out to all Communications, which means that the gpus have to talk with each other. Quite frequently and as we know, it's a computation on a GPU is much more uh faster than the communication between gpus.

A

So we want to avoid data communication between gpus, so to make it happen, we just exposed the other two levels of partisan to our code, so we enabled uh two more polarization levels. The first one is the loop over spin channels and the second one is a loop over uh wave functions and by doing that, so why does this help is because it basically it further partitions. The ampli runs within a image into even smaller subgroups of rungs, so that each working group becomes smaller.

A

So now so from here is just the example, so, instead of eight gpus work together for parallel ffts, now only two gpus collaborate on Parallel, ffts and linear algebra. So by doing this we can reduce the GPU GPU communication, which is quite costly and also we can better load balanced workload on each GPU. So, ideally, we don't want to split any fft of operation over two gpus. So if memory is not a limitation, we want to do fft on a single CPU.

A

That's how we got the best performance, but when ffts I mean when the problem getting bigger, they are limited by device memory. Then we have to split ffts across CPUs, but we use the least number of CPU for each fft and we distribute the workload by using the other levels of partisan in the algorithm.

A

So on this page, I'm showing you the comparison of our Baseline GPU code, which is a black bar on this plot to the version with all the levels of polarization enabled, which is the Red Bar. So this is a GW calculation of a silicon water interface with roughly a thousand and 500 electrons. So we can see that we are able to get a 50 performance Improvement by using more levels of polarization and reducing uh restricting the parallel ffts on one GPU.

A

So another approach to speed up the code is to use non-blocking, MPI functions and async GPU kernels to overlap a GPU communication and computations. So a example shown here is somewhere in the code. We are Computing the matrix multiplication of two distributed matrices and the color here just means that the matrices are distributed on different timepi ranks. So our first version is very straightforward.

A

So what we do is we copy the local data from CPU to GPU, and then we multiply the local block on GPU and then we use a MPI communication to against the next block to be multiplied, and the timeline of this approach is shown here. So red means the CPU to GPU copy and blue means. The GPU, computation and orange is the time spent on MTI.

A

So then we can optimize this code by using uh a non-blocking MPI communication, which means that while the GPU is doing the competition uh in the background, the CPU is doing MPI communication to prepare the next block of Matrix. So by doing this we can add the cost of the GPU computation behind the MPI communication, uh and actually we found that the MPI communication part which in this case is more expensive than the computation itself. So the communication can be further uh speed up by communicating the data in single Precision.

A

So what we do is uh we before NPI communication, we convert the data from double Precision to single precision, and then we do the MPI Communication in single precision, and it turns out that doing this doesn't change the physical observers we are calculating, but it leads to a roughly a factor of two speed up in the MPI communication.

A

So let's go back to the Silicon water interface model and we can see that by utilizing single Precision in some operations in the code, including the MPR communication I just mentioned, and also some fft operations. We got a additional speed up uh to the version.

A

Two and overall we got 1.9 x compared to the Baseline and again the GW energies are matching the double Precision numbers very well uh and then uh finally, we also did some relatively low level optimizations to fine-tune the GPU memory, access and also IO operations, and we ended up with a fourth version of the GPU code, which is 2.2 x faster than our Baseline.

A

So next, let's look at how the code scales are super computers, so this page shows a benchmark of a cadmium, selenite nanoparticle, which has about 900 electrons. So uh this plot Compares this Computing GW across particle energies for this nanoparticle on different supercomputers. So, on the right hand, side there are the old nurse CPU machines, including the retired medicine and Corey as well. So the code scales quite nicely, but on the left left side you can see by using gpus it's much faster.

A

So if we compare the same number of nodes of summit, which is the orange symbols and Corey as well, which is the purple triangles, we got a 30X speed up, uh and uh so, as mentioned in the previous talks, uh switching from a summit to promoter uh or in other words from the 100 gpus to a100. So uh with exactly the same code, we got another factor of two uh which we think is because uh a100 has more device memory.

A

So we can fit more data into the GPU without staging back on the CPU and also it has higher memory. Bandwidths and even a double Precision tensor course.

A

And the scaling on the summit and promoter are also uh quite nice, the both they are close to the ideal, strong scaling indicated by the dash Knight.

A

So now, let's uh look at a bigger Benchmark on Summit, so uh this one is the two silicon supercell models, one with a thousand atoms and the other with 1700 atoms.

A

So again we see the strong scaling on Summit uh looks quite nice, so basically, the bigger system with 1700 atoms scales better because it has more competition to do and it's again quite close to Ideal scaling and uh weighs 94 of the entire Summit machine, which means 25, 000 Nvidia waiver 100 gpus we were able to solve 80 cost particle energy levels of this big silicon super sale in about half an hour.

A

So last I'd like to show you some uh production calculations uh done with the GPU version of uh West, so showing on this page are three very large GW calculations. A large nanoparticle weighs more than 2 000 electrons and a giant a silicon silicon, natural interface with over 10, 000 electrons and then a spin defect in 4-H silicon carbide with 6 000 electrons and those calculations were done.

A

Using production settings not some loose inaccurate settings so and we are not only Computing a few energy levels, but we computed thousands of cross-particle energies to plot the local density of States, as shown here. So this shows that the a GPU version of Last can be used to solve very large problems that can hardly be done on CPUs only so to wrap up. uh So we have ported the West code, the GW part of West to Nvidia gpus, and uh we achieved very good performance and uh scalability on GPU accelerated supercomputers, including promoter.

A

So we carried out some large GW calculations, of which the largest one has 10 000 electrons running on 25 000, Nvidia gpus, so uh recently reported the quantum embedding part of West to gpus. So we are still testing its performance on promoter. uh So the initial results look quite promising and we also plan to Port the other functionalities like the BSE code and electron funnel code to gpus. And finally, we want to make sure the code works not only for NVIDIA gpus, but also for AMD and internal ones.

A

So, to achieve this, we are moving to from open SCC to compound P Target offload, so uh we want to mix code work on the xscale machines like Aurora at argon and Frontier at Oakridge.

A

So uh here are the people and organizations who helped us on this project. In particular, we we are part of the needs, nurse knee sub project, and thanks to this project, we got help from a nurse expertise and also we got Early Access to Curry, GPU and formatter, which are invaluable resources to us.

A

uh Okay, I think I'll stop here and let me know if you have any questions.