National Energy Research Scientific Computing Center (NERSC) NUG 2014: A Celebration of 40 Years of Science and Technology, 9 Mar 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Programming Hybrid HPC Systems: What is Old is New Again

Description

Viktor K. Decyk, UCLA

A

Victor desick, he is a research physicist and adjunct. Professor at ucla he's been involved in using and developing particle and cell or everybody calls pic. Everybody knows what we mean with say pic his entire career, beginning as a graduate student under john dawson, one of the pioneers of the technique he's been active in parallel computing since 1987..

A

That was right at the very beginning. Wasn't it hypercubes?

A

He was the principal architect of the u-pick framework, which provides high accuracy and high performance spectral components for building new parallel pick codes he's currently developing pick algorithms for emerging computer architectures such as gpus.

A

In addition to numerous awards and honors victor received the john dawson award for pioneering advances to plasma physics obtained through simulation from the international conference of numerical simulations of plasma in 2009..

A

So as a long-time user of nurse and hpc victor has seen it all pretty much and he's going to offer his perspective in this talk. So welcome victor all right.

A

Here's, the pointer.

B

All right, so this talk is really about programming uh and uh recognizing that a lot of what looks like new has actually been around before.

B

But what I'm going to talk about in terms of programming specifically are pick codes, that's kind of what I do so for those of you who aren't into plasma physics I'll just say what this is so pick codes integrate the trajectories of many particles, interacting self-consistently, by their electromagnetic fields. So they model plasmas at the most fundamental and microscopic level.

B

They're widely used in all areas of plasma physics, including fusion.

C

B

That we heard about yesterday, plasma accelerators.

C

B

We also heard about yesterday and many other areas. They are the most complete but most expensive model.

B

They are used largely when continuing models fail like mhd, and you need some kinetic physics or they're often used nowadays to as a testbed to to verify a a more simple model that has thrown away some physics, but you're, not sure if it threw away the right physics.

B

So uh so we so these simple models are often very fast, but you have to be careful on their use. So pick codes then give you a test bed to verify that the more complete codes do so. The largest calculation nowadays is about three trillion, interacting particles, uh so they've gotten very big.

B

So I've been around a while I've been using super computers and quotes for about 40 years, starting with the ibm 360 91 and ucla in 1973, which is the year john dawson arrived from princeton uh at that in 75. The performance of my 2d pick code, which I used in my thesis, was 99.

D

B

Per particle per time step, so that's kind of a starting point. The cost was astronomical, it's 30 cents per time step, so this actually was printed in the bottom of the output as to what this cost, this was a typical overnight run, cost the same as my monthly salary as a teaching assistant.

B

So this scared, the hell out of me, because I was so afraid of making bugs and spending so much money. I was terrified all the time.

B

The first machine I used at nurse, which was then mfecc, was the a machine, a different aim, machine than another aim machine that came later, which was the cdc 7600. uh One interesting feat designed by seymour craig before he had play machines was a machine had an interesting feature in that it had a memory hierarchy. There was a small core that was fast and the large core that was slow and the user managed that himself. Okay. Well, what's interesting, is you see the same stuff on gpu?

B

So this so you say, oh this looks very familiar on the a machine. My code gave a performance that was slightly worse than that, although the peak speed of this machine was was higher about twice what this ibm was, but the results were similar, so this is kind of a starting point, my career.

B

So during this time I have experienced uh two revolutions in super computing and the first was the introduction of vector computing, which came in the in the late 70s early 80s, and we all had to learn how to vectorize our code and that uh we had these nice little crays and uh actually I tried to.

D

B

Of the cray from mf from nurse off the web, but I couldn't get it off the web, so I had to get another one, the uh anyway in 1984, the performance of my code was about eight microseconds per particle protection.

B

uh I also went to japan and used the fujitsu, which was actually even better, uh and it was a performance of about an order magnitude from what we had so vectorization was was cool but because vectorization is made come back. Let me go talk a little bit about that in more detail, so one place vectorizer vector code appears as it feels solid because it's usually a fairly straightforward and regular problem compared to pick where particles are going all over the place.

B

So vectorizable means that the elements of a vector can be processed in any order. So that's also the definition of what's paralyzable uh and another feature that was important on craze was that stride one memory access was was optimal, so in other words, if the, if the memories of each adjacent vector were stored next to one another. That was the best thing you could have so, and that was primarily to avoid bank conflicts, so the memories were organized into banks.

B

There was no cash here and uh if the worst thing would be there were, I don't know there was something like 64 banks or something, and the worst case was if you had a stride of 64 because you'd hit the same memory bank all the time, and uh so you didn't want to do that. But anyway, here's a actually a code from uh similar to what I had in those days, and this is a field solver.

B

It's actually solid, poisson's equation, fourier space- and here this is the inner loop uh over my y coordinate and uh if I ju and would, if I just uh put this in the compiler.

D

B

Wouldn't vectorize it, it would say that there was a vector dependency in there and that vector dependency was really coming from here, because I'm doing both positive and negative four-year modes at the same time and the compiler couldn't tell that this. These values might not.

D

B

With the case so that these two actually might be in the same location, so it wouldn't do that. But the fix was easy. You put a little compiler director, which is ignore, vector dependency and it was okay. I can go, and so it would do that. So it was so code written like that. It was quite straightforward and if you're, an old timer with lots of code like that you're going to love gpus because they're, that's really pretty straightforward.

B

The second revolution that I had to deal with was the introduction of these distributed memory machines when we had to learn how to paralyze our code with message passing.

B

So I actually had an early start on this, because I worked at jpl and caltech, with their very early hypercubes in the around 1986 or so jeffrey fox was one of the first people who built the first cosmic cube, which was a bunch of ibm pcs connected somewhat years later. Then we had a nurse got their t3e.

B

They had a t3d earlier than that that we used. That was, I don't know, I guess, a loner or something or uh anyway. So the speed of the code then now went up to 13 nanoseconds per particle for time step. So that's uh nearly a factor of 500 over what we had in the early cray days. On the other hand, the mac on my desktop was running at 2.4 microseconds.

B

So uh conclusion is that the desktop machine became equal to a super computer about 10 years later, so more or less. Not not only that, but 3d now became practical before this. It was really very difficult to do.

B

3D there wasn't enough memory, uh and so we we 3d code, was running around 74 nanoseconds and we use this machine for uh the numerical tokamak project back in uh around 1994, or so uh I think originally with the t3d and the mpi wasn't even existing that it was still using pvm for the message passing and so on. But eventually things got standardized and it took a while.

B

So so I'm going to talk a little bit about how a few slides of how this was done, because it's going to come back to haunt us here. So one of the ways. The thing is that we're using distributed memory. So that means the the processors, don't share any many any memory whatsoever, and so we, what we do is we partition the space among processors, and so this is called domain decomposition. And here we are uh there's like six particles in this uh this processor and then there's six in each of them.

B

So we find a partition such that each processor has the same number of particles and you can- and you can calculate those partitions by just doing a a running sum or a prefix scan in parallel, which you can do in parallel, and that tells you where the boundaries are and in 2d or 3d you just do it again, so you first do it in one coordinate and then the prefix scan on the other, and you can find where these partitions are and we had a particle manager responsible for moving particles.

B

So when the particle got to the edge of the region and moved into this region, there was some piece of software that would make sure it moved that checked and moved that particle over to that to that processor. So this this worked really well. We this is the scheme that people basically use today and uh uh for it for the very largest uh calculations it scales to. You know hundreds of thousands of processes, actually a million nowadays.

B

The other thing that we did, that was that we uh we use what sometimes computer scientists call a kind of a bulk synchronous model. So we don't overlap, communication and and computation for a couple of reasons, one there wasn't very little to gain because the overlap occurred at the edges, which didn't really have a big impact, and secondly, it really messed up the modularity of the code and made it very difficult to do that.

B

But- and it also turned out that there were really only kind of four communication patterns that we really needed for the most part in the code. So that made it much easier than you might imagine. When you I mean computer scientists all complain about this assembly, language of parallel computing, but and and.

C

B

Is but but there isn't much.

D

E

B

Or at least, if it's well designed, so we had a particle manager, which is one subroutine. There was a field manager that took care of the edges. I had a couple of subroutines one thing: we had a partition manager because actually we often would change partitions.

B

If it's a spectral code, we would change to a uniform partition for the fft. uh Otherwise we might move partitions just for dynamic load, balancing uh there's also the fft use the transpose. So those are the primary communication patterns that we just reused over and over again.

B

uh Okay, so just one last thing to kind of show you some real code here, so this is uh from the particle manager uh using mpi. Basically, particles which leave the domain are buffered and then they're sent and received to neighboring processors. So so this is the send buffer.

B

Some piece of code has gone through the particle list to see who doesn't belong here, buffers them into here, and then this data is sent and then the data is received and there's two buffers.

D

B

Going left and one right and if it's a multi-dimension, then there's.

D

B

Loop, that does up and down and with in and out with the same subroutine. So you you send these messages and uh when they arrive, these are asynchronous.

B

When you arrive, you can find out how many particles got there from this get count and- uh and that was you know more or less the guts of of of this thing. So you buffer, the particles, send them receiving, and then this, when the receive buffer, gets unpacked and inserted into the particle right.

B

One really fun bug was leaving this thing out. This is waiting for the scent to go and you don't really. You know. From other reasons, the send was was a cent, so you, you kind of think you might leave this out because you don't need to wait. You already know it's gone, that's a bad mistake and it leads to a really nasty bug where the code will run for 100 thousand time steps and crash, and the reason is this actually has a side effect.

B

It releases the buffers that were allocated here and if you didn't release them, you were toast. That's a very hard bug to find.

B

Because it happened, you know a hundred thousand times later. Okay, so.

D

B

Out type of algorithm that's used in in these, these parallel machines was was blocking or tiling, and the reason is that they now typically use risk processors. They had caches, the craze didn't have caches so uh and the idea, basically, is you want to process data in small chunks, which you know fit into the uh the fastest memory that you have. So you put it into a fast memory. Calculate then don't go. Do an x joke. It's also useful for uh memory management, one of the things that is important to think about.

B

As we get to higher and higher you know, performance machines is the member becomes a real, even more and more of a bottle, and you have to pay more attention. So one thing that uh we sometimes use is we'll transpose the data, so this data here is in fortran.

B

The inner loop is the first one and data stores a fortran stores data so that the inner loop is the fastest. So this loop has is reading the f data with stride, one, so that's optimal. Every location that it's reading is is next to one another. So this gets the best bandwidth.

B

This guy, however, is terrible because he's got a big stride, namely ny, and especially uh that means that uh so you'd think that wouldn't matter so much on the wrist process, I mean I'm afraid that was a disaster here, it's not quite so bad, but the reason it it's important is because these systems bring data in it with a cache line, so they're not just bringing in one piece of data to bring in a whole cash flow, and if the data is adjacent, you get all the rest of the data for free.

B

If the data is not adjacent, then you get in the whole cache line and you throw away most of it and then you get to do it again. So so there could be a substantial hit uh with reading data with that strike.

D

B

If you have an algorithm like an fft, where you first work on one algo coordinate for a long time like log n loop and then the other.

C

B

Then you're better off transposing the data so that you get good strides here and a fast way to transpose then is to use this sort of blocking or tiling algorithm. So what you do is you break up this into a four-way loop and you copy into a small buffer, which is typically smaller.

C

B

The l1 cache or the l2 cache, whatever cache you worry about, and you read this data with optimal stride into this buffer and now you flip j and k and you write it with optimal stride. Two, the buffer, don't have bad stride, but.

D

B

Fast right because this memory is really fast, so this way you get almost you get straggling coming in stride, one going out and a little bad stuff here, but it's you know uh fast memory, so so this is this. This is used in like matrix, transpose libraries use this uh so there's lots of uh algorithms. Many of them are hidden from you users, because they're in some library that you didn't know was there.

B

Okay, all right so now I think we're in uh how's my time. Okay, we're good! So now I believe we're in the middle of a third revolution uh in both hardware and software, and this revolution is driven by a number of things: uh increased memory.

D

C

B

Moore's law with frequency is sort of gone, but moore's law, as with lots of transistors, is going strong, but now it's done with increased parallelism. uh We have a requirement for low power. Computing and part of this revolution is driven by that, and uh and at least my goal is.

D

B

Computing, um that's one of my goals. The other is to you know, empower. You know anybody less ordinary people who don't ask access to nurse to do really cool stuff too. Okay, so two leading contenders uh are nvidia gpus in this intel 5, which is related to knight's corner, perhaps formerly known as this one has a bit of a start. So I'm going to start with that one because it kind of came first, so gpus are graphical processing units originally invented for graphics for gaming, actually it they became programmable in 2007..

B

People have told me that they didn't do this for the scientists that they did it, because the gaming people wanted to control special effects themselves, that the hardware wasn't fast enough and provided so they didn't perhaps do it for us, but nonetheless we can take advantage of it. So the gpu consists of about at least a high end. You know 12 to 30 cmd multiprocessor, so cindy stands for single instruction, multiple data, which means it's using the same instruction on a whole bunch of numbers, not just one.

B

So, in particular here on these machines, it's like 8 or 32 numbers simultaneously, and they are being done in lockstep and one reason this is important is it it takes much.

C

B

Energy to do 32 identical things than it does to do. 32 separate things right, so it's very energy conserving to have this kind of processor, but these are essentially vectors. I mean they're they're likely we saw on the craze.

D

They're, the hardware-wise.

B

They're not the same effect. The craze were really pipelined architectures, but from a software perspective we could care less. They look the same to a software developer, so they have these multiple vector processors, with a small, fast memory, typically 16 and 48 kilobytes, and they have and the vector lengths are- and they have also a slow uh global memory. Now the slow should be in parenthesis because it's about an order magnitude faster than the memory on your desk on your mainframe on your host machine.

B

So it's still pretty fast, uh that's readable by everybody, but one a unique thing about nvidia in particular, it has very fast lightweight threads that you can. They can switch out in one o'clock period, so they use this feature in a number of ways. One of them is to handle uh well I'll talk about that a little later, but they so they have very high bandwidth. Now, it's you know, 100 to 200 gigabytes per second, but only.

D

B

Access is ordered, in other words, they're bringing in a cache line, so you got to take advantage of that. Secondly, it can handle thousands of threads simultaneously, and so you can use this to essentially optimize the memory access. So, in effect, what happens is you have suppose you're going to read suppose it takes 200? You know clocks to read memory. You can actually have 200 threads lined up one or another. You read one one clock period you go to the next thread, hit read it go to next thread and basically pipeline memory.

B

uh This way, so it's a way where you can almost completely you know. If you have enough threads, you can actually hide the memory access, so the intel mic is actually very similar.

E

B

Introduced 2012, although knight's corner was around earlier, but this is now a commercial product, so they have 50 to 60 processing units so twice what the video I was using, uh but the vector length is smaller, so they're uh and they do have uh caches.

B

They have also high bandwidth, in fact higher than the nvidia 300 gigabytes, uh and uh but they can only handle four threads for so they they they don't. They can't handle as many threads outstanding as the nvidia, so they're, similar but different.

B

So basically, what I'm arguing is that uh that these machines are all similar enough, that we can program to sort of an abstract model of what these machines are and not worry about the actual hardware. So this is my abstract figure. So a node consists of a bunch of vector processors. They have fast memory that they have.

D

B

Have access to within themselves, but they are connected through a slower global memory and then we could uh in fact connect them with mpi uh that way as well. So so this vector unit works in lockstep. It has local synchronization.

B

So there is no global synchronization on this hardware, although amd gpus do allow it, uh and uh so so we so basically within one of these nodes, there's two levels of parallelism that you have to program to one is this vector, and the other is the fact that you've got a bunch of these running simultaneously in parallel, so uh but the the nvidia.

D

B

Gpu we're using has 500 cores some of the newer ones have more. So each of these nodes is already a very powerful computer all by itself, and a supercomputer will then be a hierarchy of such powerful computers. There's a language called opencl that employs this abstract model within the language.

B

I don't personally like this language, so I don't use it, but I like the model, so I basically programmed to this model.

C

B

I don't necessarily use the languages that they provide. In fact, the different vendors have different ways of programming it anyways. So so this model applies to a wide variety of processors. So nvidia gpu is one vector length, which is where this is actually a virtual vector length, something they call a block size.

B

So it doesn't, it could be larger than the hardware, so it's like a they might trade 128 uh together, even though the hardware might only have 32., so the block size could be even bigger than that, but usually 20 128 is typically a good number number of vector processors is about like that. There's a fashion memory. I've showed is about this size. Intel phi is similar. The vector length is half the number of processors twice.

B

The cache outline is about the same as here, but a traditional multi-core processor say.

C

B

C

B

Sse units- you probably don't even know, what's in there, they have a vector length of four and, depending on how many of these processors you have.

D

B

Might have, is, you know, 4.

D

B

16 not on your laptop, probably but but on the host machine there might be 4 or 16 of them. We often don't program these explicitly but they're there, and they also have the same stuff. Actually even the prey, xmp back in the day had a vector length of 64 and two to four vector processors, but no, no faster memory.

B

uh So it's a very general model and that's what we're going to program towards so, but the other sort of fairly good news is, although so, this model has a lot of stuff we're very familiar with this is not the pieces themselves are put together in a different way, but all the pieces are something we know this thing here. We have vector algorithms, which is, as I said, or calculation of elements who can be done in any order. They have a lot of history on the crank.

B

Much of this has been forgotten because younger people come along, but there's some old fogeys that still remember but they're going to relearn blocking tile out of them that came out. You know kind of once cache got introduced into these processors, then uh these algorithms are.

C

B

Known and uh and then domain decomposition partitioning memory, so the different threads. So this this cindy, you know this guy doesn't know about this. Guy doesn't know about his memory, and so uh we have a long history with mpi uh with that kind of code. So all the pieces are actually familiar uh but they're put together in a new way.

B

So one way is that there's really two levels of parallelism involved right from the get-go, so there's a vector level and a parallel level that really are programmed together and our earlier languages didn't support that kind of kind of ability, but the new languages are coming due, so cuda, opencl, open acc all have the ability to handle two levels of perilous, openmp.

B

Nvidia. Sorry intel is adding vector directives to their version of openmp and it will probably eventually also become in the standard.

B

But the other thing that's really surprising- and this this really blew my mind away- is that gpus really require you to use many more threads than physical processors and and for somebody who's programmed mpi and vector just say what or openmp. Why would you want to do that? Because you'd never do that with openmp, you wouldn't have more openmp threads than you have processors.

C

B

Mpi nodes: well it it has two really important reasons: one is it does it does hide memory latency. It allows you to pipeline memory uh and that's important, but the other one even more unexpected was you can use a master slave model for automatic load balancing right. So you have suppose you have one core and a hundred threads a little sub pieces of work to do so. Some take more some take less in the old days with open end with mpi.

B

You'd have to worry about where to put the domain to make sure they all work. But here you just you know one guy finishes early. He gets some more work, one guy finishes late. He doesn't and eventually it all kind of statistically balances out. So this is this is well. Velvet is a kind of master slave model, but uh we we never uh thought of that, and so so now, actually nvidia has a certain amount of.

D

B

Automatic load balancing built into their hardware, so it will actually, as these blocks finish, we'll find a free block and.

D

B

They're even having the ability to go, give work from one multi-processor to another so that uh you have even more, but you can also do it manually. I mean, if you know your own code, you can manage this kind of stuff uh yourself, not only you wouldn't do it on the gpu, but suppose you're writing another language.

B

So the other thing that's so this. This was a real surprise to me, because so, if you'd asked me a couple years ago, you know, are you going to use a billion threads in a pick code? I said you're nuts there's no way you can do that. But now.

C

B

See why you could do that actually by having these micro apparel things, you have a lot of flexibility about people so that I did that was unanticipated to me and a new feature. So the other thing that's optimizing. Data movement is even more critical than it was before. It's always important, and now it's really dead critical that you've got to do this.

B

So so, let's look a little bit about some of these. Let me do this uh algorithms, so here's the vector solver that I had before this is from the gpu code. You can see there's only three lines of code that changed right, so so one is that the loop index this used to be a do loop is now essentially this thing, and this is essentially analogous to the your node number and mpi.

B

Everybody says: what's my node number, okay, except that no number is two dimensional in in cuda, but uh but that's basically your node id uh and then uh because the number of threads actually running may not be an exact multiple of the.

C

B

That you're executing you put these if test, to prevent you from sort of to deactivate the threads that aren't doing anything useful.

B

So it's really easy so, like I said, if you have a lot of dusty code from the crank you're going to love that gpu, the uh here's a blocking tiling algorithm, so this is uh used in the transpose the transpose on cuda is even worse than because it's reading data in big blocks, even bigger than the cache lines.

B

The penalty for reading with a bad stride, is quite large, so so doing transpose. If you're going to repeatedly, you know access data.

E

B

Way is even more credible, so here's the original fortran uh this is cuda fortran. By the way you could do cuda with fortran, and uh so you have a shared memory which you allocate, and you declare this. uh The scrap, temporary memory and here's the loop index so now become these thread.

B

Ids you copy this there's just you know the if statement replaces uh I should have black put that in black, because because many threads are working, you have to have a synchronization point, because you don't want to be not everybody's writing this at the same time. So you want to make sure that you don't start reading it before everybody's had a chance to write it, uh and then uh you uh so here the loops uh parameters, flipped and here the same j now becomes percent y, and here it was percent x.

B

So that's that's equivalent of how you flip these things and uh and so on. So that's, basically, you know fairly straightforward sort of copy uh in this uh this cuda language. One thing that's also interesting. You notice that this, but I have my buffer, really needs n block times n block data. When I made the leading dimension number one one bigger well, that's an old crate trick to avoid memory. Bank conflicts, because this.

D

B

Has the same kind of memory bank conflicts that the cra has so here's another old story coming back and same old trick that uh you use you just make this not a power to say not a priority, make it 65 instead of 64 and most of the bank conflicts disappear. So so another new idea, that's really an old idea.

B

Oh that's interesting because my fonts changed here.

B

I guess you don't have the font. I used on your on your version of keynote yeah. I had a special thought: uh okay, so, uh unfortunately, this is the only equation in my time the uh so now I'm gonna. So what I've been talking about is sort of fairly general. I'm going to talk a little bit specifically about the pick code that we have and what we've done more in more detail for the for the gpu.

B

So here we, this is kind of a cartoon of pink code, really has three steps: there's a poisson equation, uh you deposit a charge density and then you advance the particles after you solve the electric field. This is kind of the easiest one, but so basically there's interpolation.

B

So if a particle is here, then you will deposit, you will distribute its charge to the four nearest grid point and if, if when you get to the acceleration, the particles here, you will interpolate the electric field from those four nearest good points and the problem with the push there's no problem but for the deposit. If there's another thread working on these four points, then they share this grid point in common and they could be re writing to the same grid. At the same time, and that's.

D

B

So that's a problem so for gpus it became clear early on that, the because we have low computational, intensity and and the ratio of the memory bandwidth. Even though it's an order of magnitude higher than the on the main, the processors are even faster. It's a teraflop machine, and so so the pick codes really have low, computational intensity. They're really.

C

B

uh And and the memory is traditionally irregular, so that's.

D

B

For for gpus, so we can't really how we realize we can. We can really optimize this, but if we had.

D

B

Streaming algorithm, where we, basically the data part of the day and the field data would move in, you would process it move it out and never read it again. It just was one one pass through so, uh and the only way I could think of to do that is that the particles have to be ordered. So all the particles within one little pile have to be processed at the same time. Okay, and uh so, if that were true, then things would really be easy. You.

C

B

Minimize global access, because the field elements only need to be read once for that whole group of particles, there's no gather scatters, except at least not the global memory. You do have.

C

B

To have a local, fast memory and these deposits in part and the particles can.

C

D

B

Stride, one which you need for reading this table, but the real challenge was: can we actually reorder.

D

B

In any reasonable time frame to make this work- and this took us several years to figure out a good way to do that, so we had a lot of kind of things that sort of worked, but eventually we kind of figured. We got something that.

D

B

Did work so we're really pretty. We think we're we're in good shape now so the particles, then you divide space into these micro tile. These tiles, in this case I'm drawing two two by three cells in each tile uh and all the particles in that are stored together.

B

So this is nothing more than domain decomposition from the mpi code, but on a microscopic level right, and so that's.

D

B

Concept: we've seen that before.

B

So the way this thing works is when we uh at the when we enter when the the code for the tile each tile is done in parallel and independently. So when you process the tile, you first copy the fields from global memory to fast memory. So there's a so that's the one, and only great read for that when you deposit it, you also write the global memory. When you push you, don't uh all these tasks can be done in parallel.

B

There's some guard cells that aren't in the picture so basically for push it's trivial, it's just very similar to mpi code, except that partitions are tiny for deposit. You have this possible data collision, so it turns out it's easy if each tile is just controlled by one thread, so it depends on how you assign the threads to what. So, if you give.

C

B

To each tile, there's no data hazards and everything works, but if you want to get better performance then you can uh use a vector of tiles on one type I mean a vector of threads on one tile and then data collisions if possible. So this only works. If the system supports atomic updates, an atomic update means you're going to treat an update, you're going to add one to something that and it's not interruptable right, it'll be complete before someone else tries to add.

C

One in the middle.

B

Of your thing, so nvidia does have atomic updates. Actually openmp has an atomic update command and but not everybody does.

B

So so the hard one was this particle: maintaining particle order and.

D

B

The way we did it was, we first create a list of particles that are leaving a tile, then using the list. Each thread places the outgoing particles into an ordered particle buffer and then there's a synchronization and then you're reading from that. So let me just draw a picture here. So here's a tile and there's a bunch of particles going in these eight different directions direction, one two three!

B

So all the ones that are going in one are stored together in this buffer and the two are stored next to it in this bubble and so there. So this is an ordered buffer. So, with the mpi code we had a left and right buffer. Here we do all of them simultaneously, and so all of these directions are done at once, not just first there's not a loop over that they're done simultaneously and then, when you go read them. Everybody knows whoever this guy needs to get data from eight different directions, but he.

D

B

Exactly where to go uh to get them, so you don't have to search or anything like that, so this all runs in parallel. All the tiles are running in parallel and these guys are done.

C

B

Vector mode so they're all done also simultaneously.

C

B

So that's uh that was the hard.

C

B

uh We've we've here's some benchmark from this code. We've also implemented an openmp. So if you throw away the vectorization and just write a serial version of that, then the blocks become the openmp loops and you can run it on a traditional architecture. So here's the original code. This is a 2d. This is the same one. I did in 1975 more or less a bigger problem size though, and so it takes about 31 nanoseconds on one core of an i7 800 picoseconds on and with openmp, on 12 cores that we have on our dawson system.

B

It's about two, so this almost perfect speed up actually with the openmp, and this is uh doesn't require extra copies of anything. So only the there's only edges, so we get really good speed up. So basically, in 16 years, one gpu card is now 15 times the.

C

B

T3E that we had 16 years earlier, the cost is now 7 times, 10 to the minus 8 cents and not 30 cents. So it's gone down nine orders of magnitude, so that's good uh and the gpu card will fit in your desktop okay. So this problem, which is 150 million particles, are running on one gpu uh is a problem that was similar to the size of the numerical tokema project. We did on that t3e, so we had 100 million particles with the biggest that we did so now.

B

You can do that whole thing on your desktop with one gpu and it runs 30 times better, so things are moving along. So this is electromagnetic case. We have actually.

D

B

Speed up this is about the other. One was about three thirty times faster from a single. This is about fifty uh for that. uh I'm gonna perhaps skip over the details of this multiple gpu, uh because I'm kind of getting low on time here, but there.

C

Wasn't much that needed.

B

Basically, there was just you had to: there was an extra mpi that scraped off the bottom layer of these tiles in order to send them to the neighbors. So that was.

C

B

Main main key, but the thing I will show you is this is uh so adam.

D

B

A graduate student, our group has taken the code of osiris, which is our workhorse pig code for laser plasma interactions and plasma accelerators, and he implemented this algorithm, so the code that I'm testing was actually a skeleton code. So there's a bare bones code that I use for testing and evaluating developing new architectures, so he took those algorithms put them into osiris and he's running here now on up to 225 gpus, and this is a kind of weak scaling and it's very very good.

B

I mean it's, it's almost perfect weak scaling uh across these uh these gpus. So this problem here that he's running is uh got 22 billion particles. This is on our dawson cluster that has like 300 gpus, so 22 billion particles and it takes about 10th.

D

B

Second, to update the 22 billion particles and it's on a fairly small system. That is well not that small, but you know reasonably small, so now doing heart, strong scaling.

D

B

Is very problematical because gpu has you know: 500 000, cores and you've got to keep them all busy or it's worthless right. So when you do strong scaling, you're, reducing the problem size and it doesn't take long before it's just way too weak for gpu to do anything with it, because you can't keep the threads all busy. So it's hard to do strong scale, you're really kind of pushed into you know, really doing weak scaling and- and so your problem size determines really how many gpus you're going to use.

B

So one last thing I want to talk about is that ucla has recently been funded to create a particle in cell and kinetic simulation software center.

B

The goal is to provide and document parallel pick codes and make them open source, so I do not have time to really go through the activities that this center is funded to provide. But the key thing is the components and codes, both skeleton codes and otherwise will be available through open source licenses, and it's intended to be a community resource for pick for for anybody, and we welcome contributions.

D

B

uh For this activity, uh I will show you we: this is just getting going. We don't actually have a website yet, but we do have a website that has some of these skeleton codes, so these have been available for a few months now. So these are intended to be teaching codes and they're intended to teach people what a pig code is and they're also intended to uh teach people.

D

How to paralyze.

B

Pick codes with different algorithms, so some of the customers are students and researchers, but the.

D

B

Are also computer scientists who are evaluating, say compilers or things like that, and they want a non-trivial application.

C

B

Them so we have so we have right now we have electrostatic electromagnetic.

C

B

Codes, we expect to have other codes through the finite difference. Codes there'll be probably.

C

Gyro kinetic codes in here.

B

And others we have some codes that are have one level of parallelism, so these are kind of the traditional codes we're used to using. This is with openmp. This is with mti, and then we have a new family with two.

C

B

Of parallelism, so this has so this includes the gpu code, but also a hybrid open mp mpi.

C

B

There'll be three levels of parallels: coming we're interested in other languages and collaborating with people. For example, how are you going to express three levels of parallelism? Is there a language out there that can do that would be interested to collaborate with people uh to look at those issues. So these all the codes they talked about today, except for osiris or available here on the website. You can download it. This uh cuda code is actually cool to see and who the fortran will be there, probably in a week or two.

B

So so, if you're, a fortran person, don't despair.

B

uh So this the conclusion, then, is uh that pick algorithms and emerging architectures are largely a combination of previous techniques, but not.

D

B

uh So that includes vector techniques blocking tiling message passing one thing I didn't mention too much is data parallel techniques. This is basically vectorization with array syntax. There wasn't much history at nurse with this, but I used this on on, like the thinking machines, the connection machine that they had that well developed hpf fortran also had that kind of stuff, so those are actually fairly useful and I use some of those kinds of techniques as well, which I didn't talk about too much so programming for this hardware. Abstraction leads to common algorithms, so we.

C

B

You know an openmp version that works with traditional architectures.

C

B

The way that openmp version also keeps particles ordered so one of the things and pick codes we sometimes want, is to add additional operators like collision operators and to do that has been expensive because you have to find where your neighbors are well. Now it's free the neighbors are there. You know where they are, because everything is in these microtiles.

B

So further information there's a paper that discusses this thing that just came out actually yesterday before yesterday and there's a website that has some information as well, and there will be a pic center website, hopefully in the next week or two.

A

That's it. Thank you.

A

So that's very interesting. We have questions for.

D

Just curious, you.

C

How you set up the initial conditions for these particle cell codes? You assume.

D

The particles are just there, are they coming in from the.

C

B

C

Does that work I mean.

B

At the very beginning of the code yeah, so what we do is with the the code is initialized on the host, and so then we just have a a routine that goes and uh checks each particle one by one to see which tile it wants to and just moves into that tile. So that's not, but that's only done once at t equals zero. So is that what you're talking about well.

D

I just wondered: how do you decide how to start it with uniform particles.

B

Oh yeah for this well yeah, so that that's a physics question right, right, yeah yeah! So this these skeleton codes are just uniform right, but the osiris code has different ways of doing non-uniformity the way. The way I do it personally is: uh I should do a root finding scheme. I have a distribution function and I I look for the roots of where I integrate the particle until I get an integer, you know because I the way I normalize density so so the total integral is the total number of particles.

B

So whenever I get one that's where the particle goes, uh and so, but cyrus has a different way: they actually weight the charge kind of inversely to the density. So that's a different way of doing it too. So so there you just you just do an integral, but that's all part of the initialization. So none of that we do on the gpu. So that's all done.

A

So um you put a lot of work into with your skeleton code as you described it, and then you took the results of that or somebody took it and integrated into osiris. So would you characterize the amount of work needed to to put to bring osiris into these architectures as easy, difficult, major minor.

B

uh Well, I think medium I mean well adam, also added some additional optimizations that I didn't have because he's really kind of gung-ho about this stuff. So I mean he, he went read through the assembly language out, but he really got into it, but but uh in general the hard one was the sorter and the reorder. So you can use that as a black box.

B

I don't think you need to know how it works right and then the push and the deposit are pretty straightforward, because they're largely the same old code except there's, another layer where there's an outer loop over tiles and an inner loop over the number of particles in the tile and there's a temporary copy to a local array. Otherwise it's the same algorithm. So I would say you know for at least on one gpu.

B

It should be fairly straightforward and on my website I have a little kind of diff program that shows you what things I changed to go from. You know one version to another, so you could look at it. So I hope that it's not, but you know there, there are little details that are kind of annoying, that on the gpu every thing the subroutine it becomes two subroutines because there's a host subroutine that launches the kernel and what was inside your loop is now on a different place.

B

So there's kind of these annoying things that that are, uh you have to kind of worry about, so it will take a while, but I think there's neither the push nor the deposit have anything terribly difficult. I mean there's some new things you have atomic updates, but that's just a subroutine call and so on. So I hope the hard one was the reordering, and I hope you never have to look at that.

B

However, gyro kinetics is a different matter.

B

That was my question. What's that.

D

My question is: uh you've done all that work with cuda. How? What do you think about the uh you know, openmp like open acc, I mean: do you think they're gonna be mature enough that you know you could be.

B

Yeah they're not there, certainly now and uh and if you frank tsung actually is here: yes, he is so he's.

C

B

He's been actually playing with with uh the open acc, it just looks like they don't support everything that you can control with it. It's kind of immature, would you say that's fair right, so so victor so on pictures of code.

E

E

Viktor has a list of codes that's available, and one of them is the openmp version of it. So I took that and tried to port it to openacc, which is which is a trivial test, because many of the directors are the same, but once you once.

B

You've done that he's talking about these codes here right.

E

Once you've done that, and you use the compiler the compiler just going to tell you- I can do it, but then at that point you're stuck because it doesn't give you provide hints as to how one how to get it. So the compiler understands what you're trying to accomplish and trying to optimize it for the gpu.

B

But if it doesn't work, you don't know what to do right.

E

If it doesn't work.

B

You don't know what to do.

E

Because there's no next step and the comp, the compiler is not very good about and so you're trying to use the compiler's intermediary between you and the hardware and that I don't know I don't know how to get there. But.

B

I also heard from other people at like aps meetings like one guy at princeton who, but that the open acc is just doesn't support things that cuda does it's just not there. Yet we did try using an openmp on.

C

B

Phi to see how that worked and the parallelization part it worked great. uh It didn't vectorize anything out of the box. So clearly we need to. That seems to be the thing to work on for a five. But but if you think the structure of the code is correct, it'll work on the pi, the parallel part does work. The vectorization does nothing uh so at the moment, but we know vectorizes because the cuda has vectorized it. So no it's possible.

B

Don't know, but also part of my thing is: I'm agnostic, I don't care if I wins or cuda wins. I think this structure for this abstract machine will win.

B

I don't really care who the winner is in terms of the actual vendor. I think we can make it work right because I think they're all going to be like that. You can see that the cindy units have to be there because of energy. I mean I mean it's just inevitable- that they're going to look something like that. I just don't know exactly and the languages are catching up. I mean I they're not there. Yet I mean we with mpi.

B

We really need three levels of parallelism right, so is that going to be done with co-array fortran with some extra stuff, or do we still have to use mpi? That's all involving thing and.

B

Moving target, I think the abstractions of how you do two levels: apparel is still not totally there. uh It's new people haven't been doing that for a long time, so.

C

So you kind of have a flaw on the amount of parallel you kind of have a flaw on the amount of parallelism you can exploit with this sorting algorithm, which is just the number of grid points. Can you physic and does the physics justify just increasing the number of grid points if you end up with more the need for more uh sorry, if you end up with more hardware courses,.

B

Right exactly yeah the size tile is clearly arbitrary right, and so we we just actually when I did those better. I just scanned through and found the one that was optimal, there's actually two competing things going on on gpus one. Is that the bigger the tiles, the more shared memory you need now? The gpus are trying to run multiple independent blocks simultaneously, so they call this occupancy.

B

So the bigger the shared memory you request, the smaller the occupancy that you have. On the other hand, the the bigger the tile is the faster. The sorter goes because very few particles leave, so the reordering is very fast, so so there's typically a peak somewhere in there. Where you, you know your occupancy goes down and but, but you know so right.

C

B

It's typically 16 by 16 is what we're using most of the time, uh but it probably wiggles a bit if you're using one thread per per tile, like we did on on the not pre-fermi architecture, then the the tiles are more like two by three, because only one thread is responsible for tiling quickly run out of shape memory. So, uh but.

C

B

Can do a completely collision-free algorithm if necessary? Fortunately, the hardwares are now being able to kind of resolve collisions fairly well, but you know some future architecture might not so we're ready well.

A

Well, thank you victor. You see lots of good questions and you'll be around for individual discussions. If I would like the rest of the day, so thank you again.

A