National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ExaFEL project port to GPU

Description

Johannes Blaschke of LBNL presents a talk on ExaFEL project port to GPU. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Muaaz Awan

A

Thank you very much um yeah, so I will be uh essentially following what I uh what nick soda um presented just before lunch, um and so I want to start with just thanking all of these collaborators.

A

They are all members of nick's group at lbl and they are the ones who, like, oh, except hugo, but hugo also worked extensively with this gpu port and therefore um I I think it's only fair to acknowledge all the people. Who've done the actual work, um and so um the things uh the the technique that nick presented uh just before lunch um uh is this technique of crystallography and just in in the most basic physical terms.

A

um Imagine you have a crystal in in the sense that it's a lattice of uh scatterers and what will happen is whereas an x-ray beam comes in, there was the extras. The coherent x-rays will scatter of these scatterers that are arranged in a regular lattice, some physics happens and then on a detector. You will see bright peaks with dark regions in between and here's a sketch for a one-dimensional array of scatterers. You would see um these um localized peaks and some some maybe a little bit of in structure in between, and in fact we can.

A

We can be specific about what kind of physics happens there. uh It's all wave optics really so, when you can think of each of these scatters as sources of coherent light.

A

um In this case x-ray light and depending on which position I choose on which pixel I choose on my detector here, I will either get destructive interference or will get constructive interference at that position from all the from all the waves emanating from all the scatterers um and- and this means that um we can actually- because we understand wave physics very well- we can actually simulate um the.

A

The intensity that we would see at each one of these pixels and that's given by this equation here and the important thing about that equation- is that it's a combination of uh structure from in the individual scatterers and a combination of the arrangement of the different scatterers themselves, which lead to different intensities of this inter interference.

A

And then we have some properties of the incoming um wavelengths as well, and on top of that, we'll see some random noise that that will be interacting with the individual pixels as well. So there's a kind of a background to all of this.

A

That needs to be filtered out, and uh essentially, uh why are we interested in this sort of thing? um Well, sim? The previous slide shows uh what we sort of like the forward direction of this kind of simulation. Essentially, if we start off with a crystal and the parameter is associated with a beam, we can produce interference patterns. But this is uh the much more exciting problem is to flip it around.

A

So, essentially, we could have a bunch of diffraction patterns. These are real diffraction patterns um and what we want to know is what are the crystals responsible for for producing that kind of pattern, and in fact we can't just like disentangle the crystals from from the properties of the beam themselves, and so the the vision of all of this is.

A

We will take a massive amount of these uh images and we will cram them all into pearl matter, and out comes uh some interesting signs, some unknown structures that that we are interested in and so the way we achieve this using cctbx is essentially a hierarchy of different codes. You can take a look at cctbx here and essentially the the the overall structure of these simulations is roughly the same, and on the top on the user facing level. We have um python that sort of acts as a as the glue code.

A

That will uh orchestrate the data analysis, and so you might have something like this we'll loop over a bunch of parameters we might want to for each parameter, do some simulations and then some data io, and we don't really want to stay in in python altogether.

A

We we use boost uh python as an api to expose a c plus plus back end um to python, and this c plus plus backend um it it orchestrates all the um the I o, the data structures or the logic that you might need for uh crystallography and in as part of our port to gpus. We have uh started to take the most expensive components out of the c plus plus backend and we've started. Writing cuda implementations for these selected functions um and this um and this basically uh the these sort of cuda implementations.

A

They are um fairly um straightforward actually for for the forward simulations of bright spots um and that's because our problem is uh fairly easily paralyzed. The the individual pixels um don't interact with one another, not at least in in the forward simulations.

A

So what you can do is, um while your original c plus plus loop might have looped over pixels, what we will do is we will assign each cuda thread a a set of pixels and we will just iterate over them in a strided fashion and then for each pixel independently of what the other pixels are doing.

A

We're essentially evaluating this function here, and this has a bunch of different loops as well, and they they all deal with slightly different physics.

A

So, for instance, we will, we might be wanting to resolve some pixel details, um the detectors themselves. They they. There are some physics associated with um how they absorb um x-rays and therefore we can loop over like detector thickness, essentially the thicker the more easily it absorbs a photon and the photons can also come from different sources and can have different angles, and the crystal muds also have not the perfectly regular structure, but there might be different mosaicity, and so we might need to.

A

Iterate over those domains, and so finally, once we have so once we have computed this on the gpu, we haven't uh ported the background, the noise yet to the cpu. So on the cpu we will add this random noise, um and so, let's just see what what the end result uh of such a nanobrag simulation is so nanobrag by the way refers to this forward stimulation, um and so what a nanobreak does is.

A

It will take some parameters, some details of the crystal that you're simulating and the wavelengths and the beam, and it will produce black spots. So um there are several things I want you to be aware of here. Brag spots are fairly localized. Nick did show that they do have a structure in this simulation. I've actually zoomed in here into the center. You can see they're sort of fuzzy blobs, essentially so they they are.

A

They are not points uh in the true mathematical sense, and you can see that they all have varying intensities, um and there is a background here and um now now to to get maybe a handle on the inverse problem. I thought it might be instructive to look at what happens when I simulate uh changing the the the crystal position or uh distance from this detector.

A

So these are videos um here, for example, for.

A

What I'm doing is I'm actually moving the crystal further away from the detector, and you can see that the the spots are all moving outwards.

A

You know, so you might be able to see some noise in here that this is the simulated noise and also some aliasing, because, as as I move, my interactions with the pixels will be different, so I might see some aliasing effects and this would be fairly simple to to to look at a diffraction pattern uh and then to if you know what you're looking at.

A

If you know what kind of crystal you're looking at, then you can just look at the positions of the pixels themselves and you can say, oh well, it was at that distance from the detector. So let's look at the more complicated situation, let's rotate the crystal the z-axis, and we see something much more. uh You know trippy happening here, so we're really only rotating the the the crystal. So uh it's actually creating a completely different looking pattern.

A

You might say- oh there's, still some symmetries in here so from the symmetries we might infer what's going on, but remember a computer isn't necessarily so smart right, so to say, like oh look, for symmetry might be harder so from an inverse problem perspective to say: oh, we have any snapshot in this video.

A

What's the orientation of the crystal might even be already a harder ask anyway, so so getting back to the gpu um port um of of nanobrand um you uh so our essentially our first objective- uh and this is actually happened on before I joined the project- was to accelerate these forward simulations to to start with some known parameters and simulate the bright spots and to accelerate uh that code, and so here you can see a comparison between the performance of cuda on a v100 compared to a single thread on skylak cpu and essentially um what what that work has done.

A

Is it has condensed this this darker blue block here down into these cyan and green and yellow stripes? So so, as I said before, the adding the noise hasn't been moved to cuda. Yet so that's why this this block here?

A

If you really cared about noise- and this might be the next target, um however, let's ignore now- and you can see that we have uh accelerated the spot simulation by a factor of 22.- we could accelerate it and probably will accelerate it much further, because data movement is currently taking up about 51 of the cuda time and also there there's an api overhead, that's pretty high, so we will be looking into that more and so the reason why we we care about this is uh something that nick alluded to before, and that is we are interested in the inverse problem.

A

We want to start with a lot of simulated image. Well, what we want to do is we want to tell based on forward simulations like these. We want to tell what is the? What are the crystal parameters of this measured image here? You can see some bright spots over here. Doesn't look like anything like that, but the idea is, let's iterate over different forward simulations that minimizes the mismatch between the forward simulation and the measured data.

A

In fact, we can do this intelligently uh with quasi-newton, optimization and essentially the idea is we'll use a forward simulation and the measured pixels to guide our next parameters for the forward simulations and to iteratively decrease the mismatch and and the lowest mismatching parameters. That will be our best estimate for the crystal parameters.

A

So I also want to take maybe a minute or two just to point out that with these kinds of codes um there it's it's there's an important aspect that sometimes is overlooked when we just want to accelerate kernels in that code, and that is the full software stack can matter, and this is just a small example of what happens when we process a large batch of images stored on disk. So here we have time on the x-axis and the mpi rank on the y-axis and red is bad here.

A

Red means I o and data movement over the network, um and so over. Here we had a problem with mpi, so you can see things bunch up and there's a lot of red, so that's very bad, and then we and then over here we've reduced this. I o time by optimizing the the way we actually schedule uh file access, but none of this is happening on the cuda level, um and so um I want to plug.

A

I um I just want to plug jonathan madison's uh temory utility and here because it allows us to build profilers that are able to profile across the python c, plus class and cuda boundaries, and so just a very quick basic example in python, we'll just import this object here. This wall clock object and we can surround the python constructor for our nanobreak simulator and then in c plus.

A

We can also use temori to decorate the same constructor actually, but here I've changed the label to cpp versus pi right, so I've sandwiched the vc profiler with the python profiler. And now, if you just profile this whole thing, you can see what's happening here. We're calling nanobrag in python that dispatches a call to c plus, and the only thing that sits in between is the python c, plus plus api.

A

And then when we look at the differences in the time taken, you can see. In this example, we can pick up on the python api call time spent, um and so um this is uh so. um I I will be con will be essentially expanding the use of temporary to not only profile cuda but but to essentially to get a snapshot of the full software stack, your highness, uh you have 30 seconds left. Can you wind up? That's perfect. I am already done. I just wanted to say that all work is work in progress.

A

um Our nanograph cuda port has already resulted in a decent speed up, but if we, if we improve our data movement and our api based on what we are observing, we can definitely get way more out of that. One next on our plate is optimizing, the diff brag iteration itself and then finally, we want to look at openmp offloading but, for example, there's some little issues associated with uh the fact that python and our software stack likes gcc, which has some problems uh playing nice with openmp.

A

At the moment, all right, and with that I'd like to thank you all for your attention um and yeah I'd love to answer your questions. Thank you very much, johannes uh yeah. So can we have a couple of questions? I think so the audience can ask questions in the q a box.

A

So, johannes, do you have any plans of using this code on on a different system other than nvidia gpus um yeah? So that's why we're interested in the openmp of loading? At some point, this will run on an xscale machine.

A

Currently we have a workaround strategy that that is based on using clang and then just hoping that the rest of python doesn't block when, when you cross link um uh but yeah, so currently we've only used, um summon and corey gpu and we'll be using matter, but I'm going to deploy some of this on tulip and iris next, once we get the openmp issues figured out, so are you confident you'll be able to replicate the performance on openmp, the one you got from cuda?

A

um I uh I'm I'm confident that we'll get something to work. I don't know whether, like I'll be happy with the performance.