National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Accelerating X-Ray Tracing for Exascale Systems using Kokkos

Description

Felix Wittwer (NERSC)
Accelerating X-Ray Tracing for Exascale Systems using Kokkos

A

Hello, everyone, I'm petix, a postdoc for me for data at nask and I- want to talk a bit or I will talk a bit about my work with the XFL team on getting ready for per model.

A

Now the goal of the XFL project is to accelerate data analysis for x-ray, crystallography and specifically serial femtosecond crystallography and the main problem or the the goal of this method is to use an x-ray laser to show that small crystals of molecules and because the X-ray pulses are very short on the order of 30 femtoseconds.

A

The motion of the atoms is effectively Frozen and so, by varying the delay of the lever, you can record stop-motion movies of chemical reactions, so you can see how all the movie, all the atoms, are moving and during the reaction now the main problem with this is: you can hit each Crystal only once and so to collect the full data set. We need a continuous stream of crystals and all in all, it needs something like 100, 000 crystals for one data set, but the problem is the crystals are shot into the beam.

A

So we have no control over where the beam hits the crystals and how the crystals are oriented. But since these um we need to know the scattering of the x-rays from all directions, we need to collect as many data as much data as possible because in the end, it's a little bit like um collecting trading cards where to get a complete set.

A

You need to buy much much more card package, the packets that there are cards available, because you have no control over what the next card or data shot will be, and so because the there's only one extra x-ray laser in the US getting measurement times where we scarce and that's it's important- to get results quickly and determine is collected data useful. Can we move on to the next sample and for this reason these expert types of experiments require live feedback now.

A

Currently this instrument generates about 100 images per second and so for a hundred thousand images. One data run takes about 15 minutes to collect and so live feedback means we want to know if this data is useful in something like 10 to 20 minutes and to analyze. These terabytes of data, we need the super.

B

A

Concept where the data is sent directly from the experiment to.

B

A

Supercomputer most often um per Mata at nurse, but the big problem is the schedule, for the experimental side is totally independent of the schedule of nurse. So when we get experimental time, perimeter might not be available and then we need to use other sites, for example Frontier at Oakridge or Aurora at Argon.

A

You know poor, mother and its friends all as, of course, when should get most of their performance from gpus, but each of them uses different Hardware uses, Nvidia cards, Frontier, AMD and Aurora will use Intel cards and so to ensure that our code runs on all three sites.

A

We would have to fragment our code because each Hardware vendor has their own different programming model and to avoid this and this whole nightmare of maintaining three different codes, you need some something: a bit more abstract some abstract programming model which can Target all these different hardware and all the different hardware. And there are a bunch of options, for example, open ACC, Cocos or openmp Target.

A

Now we decided for Cocos, mostly because Cocos was already in use at nurse at the time, and so we had in-house expertise and good contact to the caucus development team.

A

Now Cocos is a C plus plus growing programming model, but the nice thing about it compared to openmp or cooler. It doesn't introduce any new syntax, so you don't have any proc mass or triple brackets or anything like that, and the two Central pillars are abstract execution and memory spaces. So, instead of targeting device kernels for GPU, the Cocos uses the five targets, a function for an abstract execution space, and only during compilation do you decide or specify where this execution space should live and be calculated.

A

For example, if you want to run down thermometer, we would just change the compiler Flex to use Cuda and the Ampere architecture, but for example. Similarly, if you want to run it on the CPU notes of Cuda of permata, you would just say to use the openmp backend and the then architecture.

A

If you want to know normal. If you want to know more about Cocos, the Cocos team has a nice lecture series, which also includes tutorials and goes into great and the nice detail, and how caucus works and also explains lots of the features now.

A

The way these execution spaces are implemented in the code itself is done via execution patterns. So the classical case is you have some some for Loop, which you want to paralyze, for example, for us, we want to simulate the scattering of the x-rays, so we want to calculate for each pixel of the detector. How many x-ray photons will hit this? On average, this this detector, and so the original C code, C plus plus code we just had to try and for Loop, which just runs through all the pixels and calculated the scattering.

A

Now the scattering of each pixel is independent from all the other pixels, so this can be trivially parallelized and the idea behind cocus is you have two things the for Loop has a policy power should be parallelized, so, for example, in our case, just runs through all the pixels, and then you have the body of the function of the followed, which just tells you what should be calculated and so going from C plus plus to Cocos means just replacing the follow with the Cocos power for execution pattern and the body of the function can model is just stay the same apart from the parallel four pattern, which just runs all iterations independently, there's also parallel reduce where you combine all the different um iterations into one, for example, if you want to calculate the sum of the squared entries of the list and then there's also a third pattern available called Power scan, which runs multiple reductions.

A

For example, if you want to calculate the histogram of an image foreign.

A

Is the memory management because gpus have their own memory that separate from the system memory? So any calculations you want to do you always need to transfer the data out from the system, memory to GPU memory and after the calculation is finished, you need to transfer the memory back. The data back onto the system memory, and so a lot of good account is just taking care of this memory management. For example. If you want to create the trusted array of zeros on on cooldown, you first need to create pointer. You need to allocate it.

A

You need to set the values to zero. Then you do some calculations and then you end at the end. You have to remember to free the memory again to not get any memory leaks. Now.

B

A

This caucus- all you need to do is provide the or use the central structure of Cocos called View, and the view is effectively an n-dimensional array. So all you need to say is what what data type you want and the size you can have so here. In this case it's simple.

A

It's just a one-dimensional way, but you can have up to I, think eight dimensions and anything else, they're all automatically zero initialized, and you also don't need to worry about freeing them, because kobus takes care of all of this, and so for us switching from Cuda to Cocos meant we could throw out a bunch of custom, written memory management code and just replace it by single instruction that says that initializes, the the arrays so to to test our if Cocos is um works for our application.

A

We use the small test program called nanobreck and nanobact simulates the factual images at the pixel level. So this is a massively parallel problem which is well suited for gpus, because they're, more or less designed for exactly this to calculate images originally for video games and the original code was written in C plus and already some years ago it was ported to Dakota and we ran it on code GPU and the coda particular resulted in about the 20x speed up, but now for permata and preparation for Frontier.

A

We reported this used this Cuda as a baseline to go to Google's and potting, went mostly replacing the Cuda kernels with coral four patterns and replacing Cuda arrays with Cocos views.

A

This took us a couple of weeks with some some pitfalls for using Cocos and getting to know Cocos, but mostly it was just search and replace and making sure that we didn't introduce any errors when we replaced the Cuda with with the cocoa structure, and the big question is of course now. How did this affect the performance for us and the our standard test test?

A

Benchmark is to simulate 100, 000 images, and we tested this running on 128 nodes of permata and the original Cuda code run in about two and a half minutes, and by switching to Cocos. We surprisingly enough got even a better performance of just a bit over two minutes, so it turns out that the original code used a lot of registers.

A

So we couldn't really occupy fully occupy the CPU and Cocos use just enough just so much fewer registers that we could occupy GPU more and thus could achieve um faster calculations and concerning portability we run the same code on Crusher.

A

The frontier testbed system is operate, which has um AMD and I-250x and the same code just changing the compiled Flex, as I mentioned before, run there for 54 seconds, which can't directly Compare the numbers between permit and Crusher, because the nodes are slightly different, but just in general it will be achieved um pretty good performance on both systems with the same code. So we'll ask going to Cocos. Was um the nice the right way to avoid whether they're going because we need to be able to use different systems?

A

Whichever is available at the time when the experiment will be we actually, we can use now the same code on the supercomputers, but we can still run it on a notebook just by using open key as a backend.

A

The porting itself was relatively straightforward, where there were some specularities which are not really focused forward, they're more, like um compiler bugs well, let's say unattended Behavior upon compiler, it's nice.

B

A

Thing is it's also it's pure C plus plus, so you don't need any fancy um new syntax and you also don't have to worry about syntax, highlighting because it's all C, plus plus one slight problem with Cocos- is that as it tries to support everything it sort of has the least the most common, the smallest common amount denominator. So the for example, Cuda Library support is limited because it needs to ensure that it runs on all the systems.

A

So if you are, for example, relying on qfft, then you would have to sort of implement this on your own, which can be done, but Cocos can only help you slightly there, even by just staying on a video. How can we gain some performance? Because the pure cooler programming was done by the caucus development team and not by us, so they are probably much more um capable of doing this and switching to different Hardware also verified.

A

Now that grants on Intel gpus, so we've more or less covered our bases and will now continue to Port the rest of our program to Cocos and then hopefully achieve some nice experiments and results.

A

The XFL is, of course, part of the bigger ECP project and there were a bunch of other people who also helped me in working on this. It was not nice hope. You have some questions.

A

Because I'm more or less finished so now.

B

Thank you Felix for a very interesting talk on how uh news gpus and GoPros for for data, and we like to remind the users that tomorrow and day after there will be more talks on gpus for data, so please feel free to um tune in for those.

B

um If there any questions, please put them in the chat. um If not actually I do have a question. um So I was interested to note that the Coco's performance was faster than the Cuda performance. um So is that? Because maybe you could write Cuda a little bit better and therefore the Kura would because I I wouldn't naively think that the Cocos would be almost as good as, but not never better than Cuda yeah.

A

But so I would say you can probably write Cuda code. That is as fast as Cocos code. The question is: is your average Cuda code as um as good, so we've done some profiling and looked into this and what we discovered was so what this method.

A

So we have a bunch of different corners and we noticed that the performance increase, but the difference between cooler like in every single kernel caucus was faster even for some simple kernels, which were just more or less a vector at and the other right here on the right. So I don't know like it's, it's probably just some minor setting or something that if you know Cuda you do this, but for the average user. You are not aware of this or something that's.

B

Very interesting, it's very interesting that Google's actually gets this right, for, as you say, the average code. That's that's awesome. Yeah.