National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GW Calculations at Scale

Description

Charlene Yang of NERSC presents a talk on GW Calculations at Scale. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Michael Rowan.

A

I'll be talking about uh some of the work we did uh for berkeley gw, which is one of the new sap applications, and um I have been pretty lucky to be part of this team and all the work we have done so far has um uh won us a uh finalists place uh in the gordon bell nomination this year. So hopefully this is uh gonna, be interesting. Work to you.

A

um We know that um material science, codes and chemistry codes are some of the top consumers of um of the computing time at various hpc centers and uh is true for nearest as well. um Some of these applications include vast quantum express org, nw, chem, cptk and berkeley gw.

A

These applications are very important, very important for the study of the underlying physics of materials, or so it's very important to for the design of novel devices like solar panels, batteries and quantum computers.

A

So the use case I'm going to speak about today is kind of a prototype of quantum computers and, more specifically, it's the study of de-vacancy effects.

A

So um what is uh gw um g stands for green's function, uh w for screened, uh column, interaction and uh gw calculations usually sit on top of some of the other chemistry or material science codes like quantum expresso abby niche.

A

So these applications calculate uh you know some of the ground stage properties and then the results get fed into some of the gw codes like berkeley, gw and recall gw. Would you know kind of refine on top of those results and get a more accurate estimation, in terms of say, the self-energy or some other properties, and um these uh gwu calculations um help uh understand.

A

You know some of these questions uh listed here like what happens when you, when you add, or remove an electron from a system. How do electrons behave when you apply a voltage or how does the system respond to light or x-rays?

A

So these are very important questions in material science and are very important for energy. Related device designs, uh a bit about berkeley gw. um So there are four different um modules in berkeley: gw, absalom, sigma kernel and absorption. um If you compile berkeley, gw you'll get four different executables and these executables can be run in log step. So you know the output of epsilon can be fed into sigma to calculate the self energy, and then the output of sigma can be fed into kernel to calculate something else.

A

So it just goes on like this, uh but uh today um I'm going to just focus on the you know the gw based calculations, which is epsilon and sigma. The other two are more on the base. Salpido equation based um calculations, um so some of the uh computational motifs for broccoli gw are matrix.

A

Multiplications uh three hair transforms large reductions, eigenvalue problems, matrix inversions, but in epsilon and sigma we're mostly dealing with the first three and when I say large scale, um uh some of the matrices have uh hundreds of the thousands of the rows and millions of columns. So we're really dealing with you know large mattresses and large calculations.

A

um The table uh down below um shows some of the kernels inside epsilon and sigma and um their computational and memory uh requirements. um So if you look at the scaling, um epsilon and sigma are a bit different.

A

So epsilon has a aquatic scaling in terms of computation and cubic for memory. um It's a bit lower for sigma. um I talked about this because um this helps us understand. You know what um the bolt on x could be for these kernels or for this these um modules.

A

um So if I, if I speak from a roofline, uh roofline performance model point of view, um these uh scaling properties would help me understand uh whether the the kernel is uh compute bound or memory bandwidth, bounds um or communication bounds, or you know um even how much physical memory we need, because we we could be limited by how much physical memory we have on the gpu or even on the host so and understanding it understanding. This is very um important. At least it has been felt helpful for this work here.

A

So um the goal of this newsapp project is to have a gpu port, an efficient one. For for the code, um we started with a pretty efficient cpu implementation, which is parallelized with mpi and op mp and uh is scaled pretty well up to having 12 uh pedal flop pedaflops on corey.

A

So, um given our understanding of this code, given you know all the jam-like calculations, the dense linear algebra, we have, um we believe that you know gpus could really help us uh speed up even more so. um The approaches we took um were really two programming models. uh One is uh cuda c plus plus and the other acc.

A

So the uh the reason behind uh you know um this choice is uh that um you know we would like to um prototype some ideas pretty quickly using opengcc, because uh it's a directive based language. So it's very easy to you know code, something up if you have some idea and because the the code is written in fortran, uh if we write everything in cuda, you know there could be a lot of interfacing, uh even though um you know include a branch. We did end up.

A

You know doing all of this, um but um for the uh the choice of cuda, um the c plus plus um version for some kernels is that we were hoping to kind of fine-tune some of the kernels um in places where you know open. Acc is not able to so that was the kind of the rationale behind that and um because we have so many you know jam um lapec fft um operations.

A

Naturally, we have relied on kublai's, qft kind of libraries and then for the rest of the coach we have to. We had to write custom codes.

A

Some of the techniques we used to optimize this this code, to make it efficient, include the long non-blocking cyclic cyclic communication scheme, uh also the the use of cuda streams and a badged operation or a batching mechanism.

A

um I I will not go into the details, but I will touch on you know the communication scheme a bit uh and also some optimizations. We did for sigma gpp, um but before uh going into that um this uh this table here with this table, I would like to show you just you know uh what kind of large scale we're talking about.

A

So the benchmarks we use are some of the silicon silicon carbide supercells, the largest one, has 42 atoms um and that's probably 10 000 um electrons and some of the parameters you can see. Cell coating in red are really mind-bogglingly large. um So um to you know, if we were not able to scale this up very efficiently, then you know the runtime would be kind of an imaginable, um and um uh so, with this um optimization with this implementation, I'll show you later that we have actually managed to.

A

um You know done this type of calculation at this scale uh within, say 10 minutes. um I can't remember the exact time, but it's within a few minutes.

A

So um the communications scheme um here we're talking about you, know really large scale, matrix multiplications.

A

If we have a matrix m like this kind of uh fat and short and multiply it by as transpose or conjugate, we were trying to get this small matrix and uh to to, um for example, if we have four ranks we would have four different copies of this small quarter of this chi matrix and each rank would be calculating a copy, and then we kind of accumulate all the copies together to to get this final copy.

A

So the conventional way of doing this is, you know, really based on mpi collectives, so each rank is calculating its own copy and then we reduce to one copy um after you know, the calculation is done and it goes on until we get to. um You know the last part last portion of the kai matrix.

A

um So you can see you know the the communication here is uh happening in a blocking fashion, and um we would really like to hide that behind the commit the computation on the gpus, so um that's the the main design, the main motive uh behind this communication scheme.

A

um So um if you look at this scheme and the the blocks, the you know, the greyish um blocks uh communication, and now they are hiding behind the community.

A

The computation um there's still roughly the same amount of blocks like this in the in this new scheme, but if we look at um you know a larger scale, this is a really point to point based communication, uh whereas this is um you know, reduction and collective based and how this reduction is implemented, uh whether it's sufficient, uh it could be another question, um but with um uh with this non-blocking point point communication scheme we were able to you know reduce the amount of communication. We do um and uh hide this communication behind the computation.

A

The pattern um of how these different copies move around the network is a bit different than in the other scheme, but well. This is why we call it a cyclic communication scheme, so, for example, for this particular quarter of kai rank 1 would be calculating a copy, the blue copy, and this would be merged with you know this green copy as it moves along the network and then it goes on and then back to rank zero.

A

So here rank zero would have the final results for this quarter, and so um this has proven very helpful um for the performance, especially um at a lower and at a smaller and medium small and medium scale, at an extreme, extremely large scale.

A

We didn't notice that the overlapping between the you know the communication and computation may not be as effective as we thought it would be, or um at least um as it was for the the lower uh scale, but for the most part it has been pretty effective and has been providing a lot of performance improvements um for the higher for the larger scale we are still investigating.

A

You know what other things we could do in order to, you know make sure the the hiding is uh still effective um anyway I'll. I will show you some results later, which has the scaling curve, and you can see you know the difference um as the scale changes.

A

So the other point I'd like to touch on is the reductions in uh sigma gpp. So what uh this kernel is doing is basically this calculation here uh the different circles represent different dimensions that we have to collapse over.

A

So all these matrices are, you know, very light at a very large scale and, um in the end of this kernel, we're right we're really trying to get a a very small array, sometimes a three by one um array, uh charlie. I think our uh coming up close the end of time. We have about one minute.

A

All right, so um some of the optimizations um we did uh include you know just moving the um the kernels from a bandwidth bound region to a computer bound region and replacing some of the instructions, um also removing some of the excessive branching. um So we will talk about this uh in the upcoming hackathon in more detail. If you're interested, you can take a look, uh but in terms of you know, results we have. um You know a very good speed up compared to the cpu implementation for both epsilon and sigma.

A

uh The weak scaling also looks very good um and the strong scaling. um So this is uh you know where we got the the full uh scale summit run uh and uh we ended up at uh 78 petaflops uh for the three page um and.

A

Up to you know, 20 000 gpus, um both epsilon and sigma scale. Very well, um it's just the last few points which we're still investigating, um but overall it's it's been a pretty. You know um pretty impressive work and um I would say you know kudos to the whole team for getting all of this done and with that I like to stop and thanks to acknowledge all the resources we have. We have used thanks.