National Energy Research Scientific Computing Center (NERSC) KNL Training 6/2017, 9 Jun 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 7 Case Study 2: VASP (NERSC Cori KNL Training 6/2017)

Description

From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/

A

Okay, so this is supposedly a talk about how the bus forget, Optima, optimized and run well on K&L, but I. Think most of the optimizations are done by the developers so recently, or we have a paper actually summarized all the of the works relevant. There is another one as well, but I. Think, oh, why don't I just say as a nurse participation? What do we have achieved? I think that would be a another case study as well. So what I want to show are present in this industry party is a case.

A

Study is actually the code optimization actually was done by the developers, actually their effort to start it like a long time ago, like probably several years ago. So the major effort is, they add the open, MP directives to the pure MPI code, so the the code, actually, the the currently official release of the code, is there in MPI only, but this hybrid code actually is under the beta testing and I think very soon.

A

Yes, sir, it's business pieces and they were made, but they will have to make, but I think the code will be ready for you know. General Officer really is really soon so.

A

So the the Vaska code is the material science code and actually to use a lot of machine time at a nurse. So if we look at about 700 applications at a nurse, it uses a it's a number one there. It's a people loop below PI over there, so it is very critical to get this code run well on K&L.

A

So fortunately the important work has been done. Optimization work has been done. Actually nurse participated, some some like exploring the MCD Rams performance impact into the cold. We we help the developers through the dungeon session with Intel, and that's happened like 2015, I, think and then another one I think we helped with is just to help them to do this beta testing and also the explore those executions based parameters so to get the code to perform better. So this is the current performance it achieved.

A

So the blue bar is the the pure MPI code and the the red bars are the the hybrid MPI openmp code. It's a it's optimized, the code. So what we can see now, if we look at the second bar group, so the one got the star. That's the best performance we get now and compared to the in a pure MPI code, it's about two to three times the speed up. So this is big performance again over here, but that was not possible if we just depend on what we handed over from Intel.

A

Actually, when we started to look at the performance on Cori, actually we see about I think about 30% to 200% of the lower numbers compared to the what Intel provided us, but the way did some performance tests and I think over the data time period. Also, we had a system improve I guess so in any way. After this excuse and space exploration, I think we brought up the the performance on Cori K&L, so the.

A

So I so I was debating like between talking about the optimization and then the performance anyway. I don't have time so I would say: I just want to present roughly at the high level. I think the code. Optimization is just adding the NPI up. I mean the OpenMP directive to the the NPI pure NPI code, and then the major challenger here is how you how they are. You know, leverage the three levels of different parallelism inside the code.

A

So basically the NPI is a top level and then the second level is the threading multi-threading using open, NP and then the next level to be Sindhi vectorization. So that's the major challenge that they have they have addressed, and then we have a we can we I don't today. I want I want to skip this part, but basically they add a lot of effort to symbolize the.

A

The code, so the code is really well, you know vectorized, so the one I want I want to present is the what we helped. You know as a nurse effort. So basically, we collected the actually six different test cases, so we use these test cases to to test the performance so that our intention is to cover the typical workload at the nurse and also we also are so the benchmark. Okay. So we try to include all the different consideration. Combine those consideration together. So one is.

A

We use like a selected, the test cases that can exercise a different decoder passes and then also we try to get the different elements in the system differently and then also consider the different problem sizes so that we use these attach the cases and did a lot of a performance test, just exploring the execution space parameters.

A

The reason we do this is actually K&L provides a lot more flexible execution and then put the time in and all these options are really flexible. So we try to explore all the possible combinations of the you know: the execution space parameters so to the at the same way exploded. Sure it's a new code. We are familiar with the pure MPI, only called, but not here to this hybrid code. So we did first, the thread su Kalyan and those who were close actually are from the production, the workload.

A

So it's not yet on addressing the likely he wrote, you know, B runs, but it's just the medium one. We can see it's up to 16 nodes we tested. So this is the one of the test case, which is a really typical DFT calculation case, and then we see the thread scaling and compare with the head. Well result.

A

So, basically, what we see is the the code, the hybrid ACOTA skills up to four to eight threads in most of the runs different in all the runs, and then also we compare with the Haswell result and we can see around the like. For knows, they have a similar performance, but to this one after that, then we see has well is better, but in another code path. This is a relatively more memory intensive workload.

A

This is something called a hybrid functional calculation and we see that when the node account increases eventually the has well catch up, but we can see that performance difference is much larger here, but the Sudhir is here good strategy getting up to eight actually I didn't including the presentation actually in the paper. We also looked into the same system.

A

We look at into what, if the, what the scaling look like if we use 32 threads per task and go even further like 32, 16 2 knows, and it was there's scaling, so it is I think actually for the for adding openmp. Actually, our goal is basically I mean for many of the code. The goal is to match the pure MPI performance.

A

It just it's not easy easy to see. Let's say on as well something like this multi-core system. Actually, it's hard to expect the hybrid code can be to the you know, pure MPI code, but in this case this code, I I, think we're pretty good with a OpenMP. There is a good speed up decent amount of speed up over there.

A

So those are similar- okay, maybe I just mentioned this sitting on this is bikes, is a really interesting one or we don't. We are not too sure what happened there, but this is the case. These spikes happened only Randy the code. You were you just the one thread per per task, and also we see for the 16 previous one. Actually, we had another spikes.

A

This one had a it happens on the 16 thread there and then recently we looked into. Why it look. We got that kind of numbers and basically the looks like it's. Some sort of interaction between the huge pages were used, but anyway only these two educators. We see some in a performance. You know negative performance from the using hype, huge pages, but the when we do this test. Actually our recommendation is the using huge pages. All the time and I think this is partially is the some interaction with the system.

A

I guess configuration and still really needed to further investigate. But anyway we first investigated the thread. Parallel scaling I will skip all those, and then we also looked at the memory mode. So, basically nurse could the recommendation actually was like either using the flattened or quad flat mode or quad Akash nose. Our test is started from there and then the tester shells are they are very similar. This is what we get to it with vast, with some like spikes and those spikes.

A

Unfortunately, they are reproducible and it's not like a random thing and we indeed needed to investigate. So basically, our conclusion is all these are test cases. Actually they fit in the MACD MCD run. Well, so it's not like a big case, so we see that as long as it the case can fit into the M CD ROM memory.

A

We didn't see the performance difference between flat and and cache mode, so our recommendation would be just to use the cache mode, so we also look at into hyper threading and basically our conclusion I just want us to keep this, but basically this is a very typical hyper threading effect for the DFT code. So when you use like a small node account, then the hyper threading help is a little bit, but the laser portion actually in the scaling region, parallel scaling region actually hyper-threading, doesn't help x-ray to slow the damn code.

A

Sudanese countries always recommend the users to you them. Just not use hyper-threading and huger pages. Are this one we also looked into and and basically in, although all the cases actually we observed by in our test is the performance is better with without huge pages or the similar or better? We didn't see the slowdown, but the one I just mentioned at the beginning. I guess that's a some sort of interaction with the system issues, so we need to investigate.

A

But anyway, that's our huge page, the test and then also it's very important one is we looked at compilers and the libraries so actually, the like other DFT calls vast, also uses actually heavily dependent on the library. The performance, especially the mkl, know, especially the linear, algebra and then FFT, it's the major. It counts for like major portion of the execution time.

A

Actually, it is very important to look into this part, so we did our compile the code with different compilers and the data some use, the different combination of library and compilers, and we noticed the best one is a spear Intel +, +, mkl, linear, algebra and FFT FFT routine from mkl. That one is the 350w interface provided by MK l. So basically we can see the best. One is somewhere here that the blue bar the green bar over here.

A

So that's, basically, is a as here krei is it of BlueBox are equilibar over here, so it's Intel, the best one is Intel, plus mkl and then securely attack. We we use the alpha library and that one is the best one. So this is a similar one, so we did come up with some best practice tips for our users. So we are pretty confident by the time like July, when, when we when Corey enters production, I know starting charging, it's already entered production. So then we start to charge the the time on Corey K&L.

A

Actually, the code could be, you know, running efficiently, so we came up with some good practices for users, so basically the how many surest to use this is the one important recommendation for them. So a thesaurus or phosphorous is good one to use in all situations and then the some of our users also compiled code by themselves. Although we provide our precompiled binaries for various reason, they would like to have their own, so we recommend Intel, compiler, mkl and the if they want to.

A

They can add alpha, but that's not amazing, but anyway, Intel compiler, press mko is our thing and then who's the pages helper, so basically use, and then the cache mode is good to go and then.

A

In in our case, so we have six to eight physical cores on a note, but to invest for a recommendation, 64 cores. This is not just because of easy to divide the thread and, and the task actually wait. This is actually the FFT performs better with to the power of the course here. So that's all all I have for this case study.

A