National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GPU Based Simulations with QMCPack

Description

Ye Luo (Argonne)
GPU Based Simulations with QMCPack

A

Hi everyone I'm mielo from Argonne National, Lab, I work in the computational science Division, and also the leadership Computing facility, I'm happy to share with with you my GPU experience with the qmc pack. So all the work is to support it by the excess scale. Computing project, so we are Community pack- is part of the application development project.

A

So as we are so things since we are prepared for the extra skill systems- and there are also pre-access skill production systems- we'd like to make a a change for the community pack in design all the algorithm implementation details make the code ready for all those systems.

A

All those systems actually share some commonality- they are all hybrid means, GPU accelerated architecture. So when we are designing your keypad, we take this into account. On the other side, we don't want to leave CPU users alone, so we try to address all the portability issue and make sure also make sure the code perform well foreign.

A

So what is qmc uh here? I'm more talking about real space, Quantum Monte, Carlo method, so here's the figure for the electronic structure method Spectrum. So, on the left side, you will get more accuracy, but it requires way more computation.

A

That's why the system size you can solve in terms of number of electrons is relatively low on the other on the right side, you'll notice that the accuracy kind of, because it's more empirical and with less evolved with the first principle of speculations and the system size, can be extremely large uh because of the compute cheap, computational cost. So qmc is actually in the middle slightly more expensive than the density. Functional Theory. A lot of people use a nursery users are familiar with.

A

You can see, however, is cheaper than the typical quantum chemistry method because of the appealing scaling into the third to the fourth depends: what problem you are solving and another advantage of quantum Monte Carlo. Is it scales very well on massive number of nodes in a super computer with in the past, and maybe nowadays, another account actually drops, but the impact in the past we we could scale up to like close to 1 million CPU nodes, not even CPU cores you, you can multiply the factor on top of it.

A

So what problem does qmc solve and we are mostly aiming at materials, so materials like solids I did some study in the past so solids and also we did simulations of molecules. This is actually not a tiny molecule. It's a molecular organic framework, it's kind of a very humongous molecule with very complex structures.

A

So all these things are of interest for Science and we want to tackle them with qmc as we are having moving from Pella scale to access scale systems, we have more compute power, so that means we either solve the existing problem, faster that typical pet Escape problems, or we get a chance to solve much larger problems even 10 times the number of electrons. So our aim is like 1, 000 atoms and tanky electrons. That's the overall effort. We are putting to make those simulation happy for science.

A

Okay, so qmc pack implements Quantum Monte, Carlo, algorithms, it's a modern, high performance, open source simulation code. All the development is on GitHub, so you can see all the history or the discussion or the issues and how people are thinking and we will come everyone to talk to us. So the code solves as I mentioned, solids 2D systems or molecules. It can be used in a wide variety of materials of interest for both the physics and chemistry.

A

The whole code is in C plus plus, and we adopt MPI plus X scheme so X in the beginning, it's open to all Cuda and right now, I think it's even extended, so we are kind of combining both. So if you look at the qmc these three words, you could figure out some patterns of parallelism. First, it's Monte Carlo, so you kind of have massive change change. A market of change can be parallelized to kind of these. These students very well those uh super computers with a lot of nodes. This provides the high level concurrency.

A

We solve Quantum problems at that scale. We deal with the interactions between particles like electron. Electronic interaction, electron ion interaction, so those in adds another level of concurrency. We can use them so historically, they will work very well on seeing the architecture on CPUs. So apart from explicitly coded kernels, the chemistry pack also heavily rely on the linear, algebra libraries from vendors plus lab pack, some additional libraries we use like hdf5 for Io fftw for the initialization, although they are not performance, critical I would say, but they are very important as well.

A

So I explained you what Quantum gives you concurrency Monte Carlo gives you concurrency. So overall the algorithm evolves in this way. So if you give you schematics how the algorithm runs, so you start with a bunch of electron configurations and you start to evolve them so the different configuration we call them workers and as a whole, it's a population so over time. If the particles does random walking, so they are very independently.

A

So it's very good for parallelism at the end, you have to evaluate something decide, what's the weight of each workers and you do load by balance at the end. So, overall the code is very easily weak scaling except this load balancing. You can't avoid that also intended to have efficiency. So now you have a basic idea of how the parallelization happens inside film's effect.

A

So now we talked about close to GPU protein uh before go to the details. First is yeah. As I mentioned, the worker parallelism is uh relatively very scalable, so in in the past, from machines like a mirror has a very efficient Network. We could hit 99 of weak scaling and for machines. Titan uh ohm Summit that Network are from crane, it's typically less performant I would say, but it's still qmc pack could retain 95 percent of the scaling. So no worry about cross node communication.

A

Now the GPU protein is only focusing on a single node, so there to do GPU posting. You need to understand the parallelism in your code and map them to the Hardwares. So in community pack we designed a way to map workers uh to different threads, but remember that actually we are doing Monte Carlo, the Monte Carlo, although at high level it looks like very easily parallelizable Monte Carlo has another challenge: is its Divergent Behavior, because you do propose moves. Some moves get accepted.

A

Some moves get rejected, so the different workers doesn't Advance at the same pace and you have to find a way to mitigate those penalties by grouping all the except together or by grouping or reject together to that kind of optimization at a lower level. The electron electron interaction, I say it's data parallelism. This typically works very well, however, that it easily hits the limitation of the compute power of a single CQ core. So we have to develop ways to go beyond when core.

A

So how, when designing a performance code for GPU- and you have to fact into all the parallelism patterns and make bad MAP them properly to your device- and you also need to care for about what Target problem you're solving. So all those details we write up a paper that it's already on timer will come up soon, so you could find how we were designing, PMC pack parallelism and how mapped all the details like how we achieve good efficiency across CPU and GP.

A

So the technology we choose to offload to to targeting GPU is open and to offload so on the target feature. So because this is is supposed to be portable cross, vendor, gpus and there's also nice way, you can fall back on CPU, you can offload to CPU and do a lot of development like correctness check those things. Those added the benefits uh is very uh how to developer friendly. You know so for Community pack we actually track the perform, uh not just performance.

A

We track many compilers, both open source and excuses, both open source ones and vendor, provided once, but it's a lengthy process to make sure they improve the quality. If meet the need of curing CPAC I. Think later you will see I mostly use Clan 15, which is the best option right now for qmc pack, but on the other hand, like you, can know it's also in a good shape. It's at least has all the correctness checks.

A

Aomp is good for AMD still lack a bit of optimization we'd like to see mbhpc that we have to work around certain features that compiler refuse to provide. So when API is very close to Clan and still needs Improvement for a correctness check good, but the performance tune is working on it. So openmp is an important component in the software stack used by qmc pack at a higher level. Community type also rely on CPU threads talked to the gpus independently. That's actually add additional challenges for the compiler runtime developers make sure thread safety.

A

All those things in addition to that, as I said, Community pack relies on linear algebra, so we heavily rely on who could I have to talk to the uh their corresponding linear, algebra libraries Intel MK as well. So we try to minimize the source code. We have, and we rely on C plus plus templates, to handle the case of real complex for Precision Precision cases, and we are very happy that C plus plus, is evolving in very stable pace and we now rely on C plus plus 17., so some performance numbers.

A

Actually this is a many years effort. I would say each cluster of bars. You will see on the leftmost. The green bar is actually the CPU only case. Then we start to port gpus, and you see the light blue ones like initially the performance really bad, because our the 1.0 reference is qmc pack Lexi, Cuda implementation, so Lexi GPU implementation explicitly using encoder its performance, but it's not maintainable and certain design choice is not flexible to be portable. That's why we rewrite the GPU code using the openmp offload load plus Library approach.

A

Initially, the performance is way behind the target we'd like to hit, but over time as we bought more and also over time, we work with Clan Community try to improve the performance of the compiler and achieve a that also helps us a lot to pass. The performance check exceeding the coda in certain cases, but now the code is newly written code, is more feature complete, so that's actually very friendly to users. The Legacy cool hpu implementation with Cuda is not feature complete and users frequently get stuck. So we are also keep improving the performance.

A

I I think we will eventually uh not eventually I think we are at the switching Point pushing users to use the new code right now. So here is the uh the.

A

So we do did most of the development WE Summit, but right now, as nurses Polaris, we have the a100 gpus. So we are all curious about how the code does with the newer generation of gpus. On this figure. You will see. Actually, yes, the GPU acceleration there's, no doubt you should use GPU and CPU only you have a huge loss, but with the latest with the a100 GPU, there's an additional benefit.

A

It's not more because of more compute power, but also because of the memory domain gets larger, so qmc pack, when you use orbitals like splines, it takes a lot of memory space and the remaining space need also needed for all the wave function proportions you store matrices, but when you have a larger memory capacity, more data can be resident on the GPU and thus GPU are more busy. So there's a huge benefit.

A

Our a100s, the problem size goes large and we we don't get bottlenecked by the memory like V100 and you can see the largest point: the 256 atom problem. The performance is close to 3x over B100, so a100 is more efficient. I also have large memory space very beneficial for the simulation, and so that's what uh the the advantage of aor hunt. Somehow this pattern is a bit of like the machine learning apps that larger memory really helps.

A

Okay, so here here's the lessons learned so far, uh so we carefully assess how CPU GPU work efficiently and map the code for the existing concurrency to those levels, um parallelism and maximize the efficiency of the hardware. So we talked our openmd plus vendor Library strategy, so compared to all the old, the Legacy GPU implementation with Cuda, so it needs like 100 kernels explicitly written in Cuda, but right now we bring that number down to 10, kernels and plus 10 off low regions.

A

So when we switch to a different uh hardware and software stack like the AMD that we need to take care of the Buddha kernels, but all the offload uh openp offloads are fully portable, so it's very maintainable and with very decent performance, I would say so. The last part I would talk about my own opinion about GPU 14. Hopefully you guys can get some things right.

A

Insights so first is I. Think GPU movement is the key of 14.. So it's the top priority. It's very similar to The Passage experience. When we were in the old days, we only programmed with the MPI. You know that internet connect is your bottleneck and you try to avoid moving data across the interconnect later with CPU. If you look at the details, you know that the DDR is slower than the cache. We try to implement cash friendly algorithm and make the data resident in touch right now for CPU GPU for GPU porting similar.

A

You know that pcie or whatever any link, whatever Link Technology, still set slower than the GPU memory. So we need to First worry about data transfer. So that's remember.

A

Data locality is your top priority when you design your algorithm foreign I personally, think you should avoid any program which ignores the performance of moving data like, although some uh um software technology telling you how you can rely on unifying memory and ask the runtime to move data for you, I always say uh you, you, the better put your effort to understand how data should be moved, have explicit control. So that's uh I can see the remote most of you, not software Engineers.

A

You are more of the domain experts, so you know that how data should behave when the cube compute happens.

A

Another point is: if you study or you follow those GPU uh lessons, they typically tell you how to do kernel programming, but that's not the key I would say: that's really secondary. You should first settle down how the data been moved back and forth. That's more of your focus.

A

Yeah second part, is you need to understand the parallelism of your code and parallelism inside the hardware you are using and only when you map them very well, so at least you need two levels on symbols on CPU and GPU. If you map them well, you will likely get a very portable code across CPN GPU and never worried about further details of the hardware.

A

I will also say that reduction operation plays a key role in scientific computation and it is all about locality and the hierarchy of how to do the reduction.

A

um I think for scientists. Cuda is not the best choice.

A

The code needs to be protected on the macro. You need to do more testing of the code build variant to make sure things are all moving right as you develop and Kuda doesn't run on the host. So you cannot do some quick, debugging machine without any gpus being installed so yeah.

A

If you could do like my experience, you know when I do open blog flow. They spend most of time just offload to the CPU correct, get the numbers right then, on GPU machine to settle the final fine, fine details about data movement, all those things if I made mistakes so yeah, if you really need Cuda, restrict it for serving Library calls. Yes, Jam, linear algebra. All those things try to rely on Library. It's not it's, not scientists, job to optimize, Those portions uh leave so scientists.

A

You know experts, know the algorithm methods, so you should take care of the overall situation on your app and leave the fine tuning of electronic performance to the engineers. They probably would be happy to help you out. So last word is yeah.

A

People might struggle with to choose between open ACC and opanky I, think their competitors and siblings, and there are pros and cons and I would say, assess your situation, your need, where you run the simulation choose one and that you use that one to kind of restructure, refactor your code to the best shape and moving to the other. One should not be that difficult, although that it's not not necessary. Okay, you can just switch pragmas. That's a bit more things, but they should kind of bring our code to the same direction.

A

That is helpful for GPU party uh yeah, so those program models both give you the capability of doing reductions, well, cooler opportunities. You have to do it on your own, so that's why it's also more appealing to use open, CC and open okay, so here I told that we designed qmc pack and we enable a new performance portable implementation, uh although this is a second time doing the GPU protein, but we we learned a lot and we find that we are doing much better compared to the Past.

A

uh We use open, p, plus Bender approach to Target other gpus and both Nvidia MD working.

A

Take the GPU porting as an opportunity to force yourself rethinking about how to design Implement your code. Don't miss the opportunity. Usually you don't have that kind of strong motivation, pushing you to move forward.

A

So I hope you guys can enjoy commercial as it's I would say: Nvidia software Hardware is a relatively a robust system and good place for production. That's the end of my talk thanks.

A

Thank you April for a very interesting talk.