National Energy Research Scientific Computing Center (NERSC) VASP Workshop at NERSC by Martijn Marsman, U Vienna (November 9-11, 2016), 18 Dec 2016

Previous Meeting

⏯

youtube image

►

From YouTube: VASP Workshop at NERSC: Parallelization

Description

Presented by Martijn Marsman, University of Vienna

Published on December 18, 2016
Presented at the 3-day VASP workshop at NERSC, November 9-11, 2016

A

So, let's discuss some some boring details about the implementation of parallelization of the code and and we'll try to to link it to the current hardware and and- and you will speak a bit about future hardware as well right.

A

Let's see, ok well see, ok, anyway, so does everybody know what an MPI rank is well MPI say if you use MPI, you start. You start these independent processes that we call MPI ranks and that will essentially run copies of the program that communicate with each other. So explicit, there's all kinds of instructions that inside of the program to to divide the work to divide the data, and things like this and there's explicit points where these processes hook up and exchange information.

A

So in that so that there are the MPI ranks these this individual processes that you start so and if we look at how the work is, is parallelized and how that how the work is explicitly distributed over over these MPI ranks, then at the highest level of parallelization, which is actually an optin optional level. We distribute over these K points that we have been speaking about.

A

So why is this the highest level, because it's the highest level in the sense that let's consider that we have 8 MPI ranks and we want to divide the group, the work out for these ranks at the highest level we create out of these 8 MPI ranks, for instance, two groups of four ranks right. So in that sense it's the highest level, because it's the first division of our total number of envy thanks, but it's an optional one.

A

So at this highest level, for instance, if you want to distribute your workout for K points, you would use this tag set K part to something else than one, and if you set this, for instance to two, you create two subgroups of of MPI ranks and the first group would work on the first K point, for instance, and the second group on the second one and the first one on the third and and so on and so forth. So the work on these K points gets distributed.

A

All for these groups and offered Cape are a number of groups of these MPI ranks in a round robin fashion.

A

So this is optional and it has been added fairly recently and and because of because it was sort of hacked in at a later stage.

A

The data is not distributed, so the data structures were not set up to distribute the data over these groups, so data is replicated so the work is distributed. The would the work on a particular K point gets done by a particular group, but the information this group holds all information for all K points, and that is obviously a bit of a problematic strategy. I mean the problem is that that redesigning data structures is always much more work than doing something like this site.

A

So this is the way this was implemented added later on, but replicating data of these groups of course means that your memory demands will rise. So you have a lot of K points and you create a lot of these groups, so let's, for instance, imagine ourselves to be on a single note that has a certain portion of memory.

A

If I now start a number of ranks on these groups and use this caper to defy to work over K points among among these MPI ranks each and every one of these ranks will allocate the full amount of memory for the wave functions so that doesn't get distributed to over K points. So your memory demands increase linearly with with k bar.

A

So why is this nice to use? If you have the memory? It's nice to use this, because it because it's such a high level of parallelization, that is hardly any communication involved. Many many operations that the program does depend only on on the K point itself, so it can work on this work on this and and only at at very, very very few occasions. Does it need information from these other groups that work on other K points, which is very nice.

B

A

Doesn't matter so so if you.

C

A

The charge density already that would for at least four DFT there would be really no need to communicate anymore and because Kohn some equations at at a certain K point. They only coupled to two corresponding equations at another K point through the density so you've at the density you keep it fixed. Then it would be really almost no need to communicate. It.

D

Yeah you say they applicated over these.

A

It does this, yes, even.

D

Though it works yes, yes,.

E

A

Yes, so it holds, it holds the wavefunctions at all the cake points, although it only works on on a subset of them yeah, so allocation of data, it doesn't isn't changed with respect to caper right. So, even if there would be this possibility, obviously, but but that would involve.

A

Actually theirs, so they do this, they synchronize, but only at very, very limited instances. They do update the information of the other guys. Yes, yes, that is true. So what does this mean? This means, for instance, that each group, so if they're synchronized, so they work on the K point on the wave function at a certain K point and another group works on another one. Then they synchronize the wave functions each can use exactly the same algorithms to compute the density.

A

They have all the informations to do this, so that I mean it's not not saying that. There's not another way to do this, and and there's not a better way to do this, but this is the way it was it. This was the quickest way to do it and- and it was added as more or less as as an afterthought, so why? Why is this?

A

Not wife is not the first thing that you do if this is not the first strategy that we have followed, because if you go to large systems, you will have less and less K points normally. So this is not the thing that you would. It lends itself for a very efficient parallelization, but under many circumstances it doesn't bring a lot because if you have a very large system, you have only one K point. Well, this I couldn't parallelize over this right.

A

Right so so, so it's not saying that this is in any way a perfect strategy, but that is the way it is, and that is the highest level. So it's the first division that we make in our NPI ranks, so the default level of parallelization is over orbitals all for this Kong Hong Kong, some orbitals, alright, so that is that is within these groups. So let's assume that we have only one for the sake of simplicity, okay, part one.

A

So we have our original group of the total group of MPI ranks and the default level of parallelization is that the wave functions in their entirety, each band gets assigned to one MPI rank and, and that is done in a in a in a clean way, so that they are not only the work on these bands is is divided between the between the MPI ranks, but also the data, so each MPI rank holds only the data for the for the for the wave functions it is responsible for so that is the that is the general the default weighted vasp parallel ices and their data is distributed as well and okay, so that is that is.

A

That is the way you could distribute your data. Obviously, where would this run into limitations? Imagine that you have a very large cell and you put in one molecule and so or or or maybe only an atom. You put a free atom in a very large cell, so this for this atom. You would have a very limited number of wave functions. That would be computed. Maybe four right, but each of these each of these functions would have a very large number of plane wave coefficients now, because you have a very large cells.

A

Also, you know a large number of coefficients per function, but only very few function, so you couldn't effectively parallelized and so there's an additional level and that's actually lowest level where we do parallelization, and that is where we further distribute the data and the work of a plane wave coefficients.

A

So within and so now here we have, our MPI ranks are assigned to to work on particular bands, but we can do a further further. We can combine this with a parallelization of a plane wave coefficients where I say: okay, now, I care, for instance, to MPI ranks that together are responsible to work on my first wave function and then the next to MPI ranks are responsible to work on the second wave function and and.

E

A

E

A

That so, these grits are are divided in a different way, so yeah, so that so for for densities, and things like this there's data is distributed in a different way.

E

A

But it's communication involved.

A

Yeah so that that would be our lowest level of parallelization that you and then you can make any combination of these two and that is controlled by the tag and core I. Don't know why I've written text- oh yeah, because there's another one that control the same thing but incor is the one that I would advise you to use, because I find the meaning of the tag, the easiest to explain and the other one. Does it exactly the same, but in a very obscure way.

A

So if you said normally, n 4 is 1, which means that each band gets worked on by one core so by one MPI rank, and if you would set this to anything else, that means that you're you're setting the number of MPI ranks that she added playing wave coefficients of one single one election function, and so, if I set Angkor to to my coefficients for for for for a one-electron, orbital will be distributed out for to MPI ranks.

A

Okay, so that is sort of so that has a few comfort consequences and and I try to try to illustrate it here and it's not not not so important, but I would like to would like to sort of try to elucidate what I've mentioned before. So this would be a rank number one and number two and I've two of them and nCore is one and in that situation my first function resides on Quora one.

A

My second one resides on on MPI rank two and my third function resides on MPI rank, one and so on and so forth, and if I do and when they get worked on in terms of F.

A

If T's of these of these functions that happens locally within one rank right, so there are routines that to ins Princeton's to compute these kinds of things, like expectation, values, elect like a hamilton matrix and, for instance, if I have a habit of matrix element between state one and state for, and I would need to communicate right because well, the score one would like to compute such an element, but it needs the information of of core four and actually what we do is alter all communication.

A

So we do a completely distribution of the data to a situation where I said that's done internally. You don't have to have to care about this, but we do a complete redistribution of the data where now MPI rank, one will hold half of each of these functions and the second one second rank the world, the other half, and that makes this kind of of evaluations much easier because those are done on a plane wave by plane wave basis. So they can each now be responsible for parts of this matrix.

A

So that is that is where also all communication comes in in investment.

A

So this is the other situation that I that I showed you and where we have more NPI ranks in this case, but we set anchor to something else than than to so now we see that MPI rank one and to share the information of band one on the and three and four of band 2 and so on and so forth.

A

I had to do some to work on those we do parallel, f50s so, and so they work together to do Fourier transforms of the of the bands which necessary, obviously because they share the coefficients and to do such things. Those contractions there's also all communication again. That brings such a situation again in in a situation.

B

Where one rank.

F

A

Part of all the bands and the other rank has another part of all the bands and so forth, and so on and so forth and back again, if necessary.

A

E

A

That is so this, so the data layout here, so what we do we store them. Obviously, if you store only the components within this sphere for for these, for this, what for the cereal FFTs they get put onto the mesh and then simply there's a simple 3d FFT here, it's it's differently in well. The parallel FFT is is an in-house production where what you do its first along one direction. So there we do an actual sphere to cube FFT.

A

So you do less one dimensional FFTs, then you would then you would do it here if you would do it naively yeah.

A

Yes, so some considerations with respect. So how does this? So? What could you conceivably say on the outset about what would be wise choices for for for these parallelization parameters? Well, k? Part. If you have K points, if you have a lot of K points or a few K points, and you and you can afford the memory, then that is a very good thing to use, always use it to its fullest.

A

If you have large functions and if you have a lot of plane, waves per for one election function definitely use this in court to distribute the work over over MPI ranks, but do not set it to more than than the number of course, the physical course that you have on one socket or on one package. So what does this mean, and normally or many of the hardware?

A

Okay, the next generation heart- is going to be slightly different, but many of the Haswell nodes a consists of two sockets that have awareness a few physical cores on each socket. So let's say ten on socket one and 10 and Osaka, and those have that they have very fast access to a certain block of memory and within the socket they have, they have probably access to to the memory of the other socket. But that takes slightly more time.

A

So, for instance, for it is oh sorry, for instance, for these FFTs you wouldn't want to force the course that reside on one socket to have to to read in work with the memory of the other, socket that wouldn't that wouldn't be wise from a performance point of view. So you wouldn't set this anchor to anything larger than the number of course was so good.

D

A

No, no, this. This does not pertain to to the version that includes OpenMP. This is this is true for for both actually for the what but the version where we use open and P. We don't use this anymore. I know I'll come to this. Yes,.

C

A

And Ankara are exactly the same, so well, they're, not exactly the same, but this Empire was was chosen in an idiotic way. The definition- and it's it's almost impossible to describe it in so nCore, simply tells you. I have n core ranks that work on one function, yeah and the other one is sort of the the inverse of it ever. You say and part tells you how many bands do I do in parallel. So if, if I have ten ranks and I say, I do 10 fans in parallel and part is 10.

A

Then each band gets worked on by one core.

C

A

It's been there for a bit, but but it should definitely be in in the current releases. So so.

D

A

Ya know it's not very prominent in our manual because our menu was really sort of lagging behind by quite a bit, and it's so which is bad yeah. Yes, so, and you can use this level of parallelization over plane, wave coefficients in connection with hybrids functions now as well. That didn't use to be the case about that has been resolved yes, so this this is for the for the for the for our current release version, the version of last that that is parallelized purely using MPI they're.

A

The version that you have actually been working on on here and that will be available to to our us users at nurse is a is a new version that that we are now beta testing that uses a combination of of OpenMP and MPI and and actually, roughly speaking, the lowest level of parallelization is has been replaced by open and people and I said. So. In that version, you can no longer you shouldn't. If you use more than one thread, you shouldn't set anchor to anything else than one.

A

Actually, it will simply be done by the code.

A

Yes, so this level of parallelization will where, where I would have, which you could think of as D as the inter node or intra socket level of parallelization will be taken care of by OpenMP.

A

That has proven to be more efficient, definitely on on on the current hardware.

A

Yes, so that's that that is very nice.

A

So yeah that that concludes actually my part of of what I wanted to say. Yeah.

E

E

A

We so we currently so for the new GW code that that we have written. We.

A

However, we, where we go through these greens functions, we don't paralyze out for this offer for electron quantities. Yeah so I mean that puts scale again, like n to the power of 4 right so yeah, because that would mean that if we would compute our polarizabilities again using these, these four electron quantities that would we wouldn't reach this cubic scaling.

E

Like I always do because this kind of throw no what you want, this are chichi.

A

No, there there's no current plan.

A

D

We can start.

D

A

Okay, I thought you would you wanted to comment on on HPC considerations as well.

D

A

Okay, okay, so you have this for this: after okay, okay, good.

D

A

F

A

Is not completely well for the has well for the has well part of Korie, I would say yes, one or two one or two ranks should be optimal for the upcoming Knights landing part that is still a bit up in the air. What is what what is what the best makes is.

B

F

A

No no, this doesn't. This doesn't affect your your memory requirements, so both nCore equals one and anchor anything other than one that doesn't increase your storage demands and that in that sense, these two levels, so these were added, are sort of standard levels of parallelization. Before this one was added there, the data is fully distributed over the MPI ranks, so that is sort of what what this tries to express right.

A

So you can either you can either have you can either have it reside on one band or one half one on one and one half on the other that doesn't change the amount of coefficients. You have to store right so at the cost of of of communication. Obviously, at some point so.

B

A

It is I expect her to be I, expect it to be an optimal point right. So if you want to go especially if you want to go to for large for a large largest system, if you would want to scale up to too many nodes, you cannot rely on Angkor as one you would have to increase. This definitely.

G

A

Yes, that would be that would be in principle: okay, provided you can. You can afford the memory I.

G

A

That that would be sort of, and if you, if you can afford it, memory wise, it can be a nice way to do because the amount of communication, then most of the communication, will then take place on the note and will not go through any interconnects as yes, so that that can of course, help a lot right.

A

The same the same underlying idea is, is mentioned here right, so that you that you wouldn't sorry that you wouldn't increase this beyond. The number of course were per socket. You wouldn't force communication between processes that that do not share as the dart that are not as close to to a certain portion of memory as other ones. Right.

B

A

So, in our case, I mean the the the areas where we do.

A

So in the areas where, where we use open and P as an additional level of parallelization there's a few examples of this hybrid functions are examples of this. But it's not not essentially mentioned I didn't mention that before so I said it was a replacement for a certain MPI layer of parallelization. So in the areas where you replace MPI parallelization with openmp parallelization, you do not save any memory right because it was it was distributed already.

A

There are. We have used openmp as well to introduce additional layers of parallelization and there, of course, the nice thing is that you do. First of all you, you do not have to redesign your data structure because OpenMP uses a shared memory model and it doesn't for the same reason. It doesn't increase the amount of memory that is used so, but because we we had already quite a lot of stuff parallelized under under NPI right.

A

So it is not the prime argument in that case right, because data was already cleanly distributed and where it was not, we have not introduced OpenMP. So where did we not distribute the data that was at the very highest level in the skyPoint parallelization and there we don't rely on open MP for an additional layer of parallelization.

A

The problem is that what we wanted to to avoid is have MPI communication inside of open and P parallel regions, so we didn't want to so. Actually we wanted to introduce OpenMP to reduce the number of MPI ranks and not increase them in the sense that I now open in a parallel region. I create ten threats that all start to communicate through open MPI or to have a through MPI. So that was sort of the the idea why we have put the open and P parallelization at the very lowest level.

A

If there's, if there is a magic number and no, no, unfortunately, not because it would depend on on system size and so because, because of the fact that it did it that did it sorry that it acts mostly at this level and parallelization all four plane wave coefficients. Because of this fact, if you have a lot of a lot of one electron functions that that in itself are very small, then it wouldn't bring a lot to to use a lot of threads here.

A

Neither would it bring a lot to use a lot of MPI ranks there right, there's always the. If you have a lot of data, you can distribute to work out for it. If you have only a small amount of data, be it bands or coefficients it, overhead will start to hurt if you, if you distribute it out for too many processes, and that of course depends on on the particular calculation that you're doing how many one electron functions. You have how many coefficients you have per function.

A

I just can't be set up really, what's the what the best, what the best mix would be.

D

A

I think the the the part of the code that is still almost in its entirety, mpi parallelized, is the part where we do linear response to magnetic fields, so the NMR part I think for the rest, almost all I can't think of any part that has not been paralyzed using OpenMP as well. I mean it's still. It is still a bad diversion right, so there will be still work on and it might be, will be still worked on and it might change still.

A

It will definitely not be as stable. Yet as our pure NPI version.

A

D

If no more questions, maybe we'll stop here for the adventure.