National Energy Research Scientific Computing Center (NERSC) Introduction to GPU Training, February 2020, 14 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Intro to GPU: 04 NVidia Software Stack, Part 2

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's continue with the session so where we left off I had described two of the three approaches for doing accelerated computing. We had the libraries and we had the directives and libraries were the kind of most straightforward and easy to use. They offer high quality implementations of common things, but are not that flexible. They have a set set number of things that they can do that are relatively well defined. Mathematical operations directives allow you to target particular loops in your application and identify a way for the competitive expose to parallelism to the hardware.

A

So you don't have to know anything about the specific low-level programming languages or models that are used for the GPU, of course, having some familiar with that will benefit, you I think. So, if you decided to use opening C C as your programming model, it would still I think behoove you to learn CUDA, for example, in courtesy or CUDA Fortran, because in order to in doing so, you would learn a little bit more about the architecture and how how work is mapped to an architecture. So certainly a benefit to you to do so.

A

But you're not required to do so for open, ACC you and you can still get a reasonably high performance implementation programming languages are for maximum flexibility right, so you have some workload that either is not easily exposed as a sort of loop that can be paralyzed well by a compiler, or you want to control how that parallelism is mapped to the hardware, because you think that you can do a particularly good job for that work. Now this is not a trivial thing to do. It is very possible to get this wrong.

A

I would say, because GPUs are complicated right. In fact, any modern processor hardware is complicated, but GPUs are complicated and the performance on GPUs is complicated. So do not assume that on day one you will be able to write a code that is, can kind of achieve that seven teraflop, a second the GPU potentially exposes, but it's also not hard I would say. Cuda is not, for example, a very hard language to learn just require some time and practice.

A

So this is a kind of overview of different. What the different programming languages are that are available for working on GPUs. Of course, we have bindings in a standard HPC languages, C, C++ and Fortran, and those are actually exposed in a couple different ways. Cuda is the the Nvidia under a kind of framework for our architecture for doing GPU computing and so CUDA directly has bindings included in Fortran.

A

We call CUDA, Fortran and then C C++ was it called as good C++ and, of course, we've already talked about open e to C and openmp, as examples for for that, but also cuda has been built into other programming languages at a lower level.

A

So, for example, if you're using if you're familiar with Python and you like using numpy, there's a library called number which can be used to accelerate loops in python on GPUs and then there's also another library called coop I, which can be used to accelerate specific types of like matrix type operations on GPUs, and so you can often write, for example, and raw Python, and make some minor tweaks to get that accelerated on GPUs there's also GPU, implementations of other packages like Julia or MATLAB.

A

If those are the sorts of things that you prefer to use, so here's an example in C for the CUDA implementation and see for how you'd pay relies some work on GPU. So on the Left, we have our serial implementation again we're coming back to this XP operation and what we have is our interface, where we're given vectors, x and y- and we have said the length of the arrays and also this scaling operation for the ax plus pride in C code and serial C code.

A

You typically just write this as a loop, I equals 0 to n, and then you just implement your a times, X plus y, and then this is how you call it on the CPU good on the GPU code here on the right and green are all of the modifications you have to make to this code to this piece of code to make it work effectively on GPU, and so the main conceptual difference with a low-level programming.

A

Language like this is that you have to identify how to target the the parallelism on the GPU to the work. So, whereas in an approach the köppen ECC, you can put a directive on top of this loop and the compiler would figure out how to do that in this code. Here you have to explicitly identify what threaded my device does what index in the loop.

A

So what we've done is we've used some bit of syntax here, which tells you how to identify a unique thread in the hierarchy of threads that are available to you when you spawn work on the GPU and then we're picking a particular index in the loop and we're mapping this thread to this indexing loop, so we are assigning parallelism to the work explicitly.

A

We were now in control of that, rather than the compiler doing that, for us to other bits of bits of different syntax are that we now have to use this syntax here with triple Chevron's to launch work on the GPU. I won't go into too many details about what this means, but essentially, if you multiply these two numbers together, that says how many threads respond on the GPU to do parallel work. So you can see.

A

This is actually quite a large number of threads and that's very typical for GPU programming you're supporting a very large number of threads. Ideally hundreds of thousands or something in that ballpark to do parallel work and then the other piece of change that we make is that we add this keyword is attribute global, which is the CUDA syntax. For basically saying this is a function that can be launched to do work on the GPU and then inside of it we can expose work to parallel threads.

A

So CUDA C is actually really cool to see, plus plus, and so you can use CUDA C++ for or traditional C code in most cases, but it's really much more powerful of that for those of you who are C++ developers and so a lot of very more modern C++ functionality is available on GPUs.

A

Probably the most popular example today is that you can do define lambda functions in the C++ code that can then be captured and run on GPUs, and so this is the fun the fundamental piece of technology- that's underlying, for example, Coco's and Raja, which are two of the big performance portability layers that are being developed by do e. So the idea is that you identify a chunk of a loop, a loop, iteration array with an index.

A

You capture that in a lambda and then the the underlying library that performs portability model does all the work of distributing that work across parallel threads right. So that's kind of kind of the same effect. You would get from using something like open, ACC openmp, where somebody else figures out how to map work to the hardware. Your job is to tell the the the interface how much work do you have to do and then for each element of the work.

A

What is the thing that you want to do, and so this is an example of like templated, a C++ code for a functor or a class where you could do similar types of operations on that, and so this is very expressive and it's not fully featured. There are certainly some C++ things you cannot do and the because of the way CUDA is architected. There's there's limitations on this, but a lot of modern C++ can be done in include a C++ one question. Yes,.

A

I can give a couple different answers to that question. The practical answer, like the the political or practical answer that question might be that open, ACC and Koko's are developed by different entities that have different goals, so open NCC is primarily a vendor supported functionality. So, like the video /pg, I implement open, ECC Cray has implemented open A to Z in the past. Gcc now implements open ACC, and so that's really a community vendor supported thing that had that is quite generally expressive and target specific loops.

A

But it's not necessarily the right approach for complicated pieces of work or ones that use, for example, advanced tables plus features, although you can do C++, for example, in these approaches, Koko's is developed within D OE as an example, and it's really developed in particular by Sandia, primarily, although they have collaborators at plenty of other labs, including neural, scan Oak Ridge, and they are targeting problems of interest to do e. So, for example, they they're kind of fundamental workload. The thing that they care most about is kind of unstructured mesh problems that we're on that.

A

That's really where it got starts like, for example, in the tree, Lino's library, that's India, and then they kind of built up from that, and so they they have pieces of work like, for example, multi-dimensional loops that are really targeted to work well for kind of typical deal.we problems of interest, and that means they do some things very well, then some things are not part of their. What they try to achieve, and the other thing I would say is that they so that gives do we some control right.

A

It gives the community control over like how this works and Koko's is implemented on many major backends. They work closely with the vendors to do it. So I would say you know both have their strengths and like one is I'm, not gonna, say one is obviously better than the other, but you might consider using openings. You see for very simple for loops. They have Cecil or simple Fortran loops. Where you want maximum ease of implementation. Koko's is much more high-powered.

A

It gives you really complicated or really advanced controls over like memory layouts and just and work implementation, but requires more work right. It is complicated. C++ code takes some training to figure it out, so there is a trade-off there and the one other thing that I would say is different is that Koko's has built an ecosystem around it, and so, for example, they have a product club, Cocos kernels, which is basically the implementation of various traditional high performance computing math operations like matrix multipliers, for example, that can be portably run on any architectures.

A

You could have that one interface that can then run on multiple places right where, whereas, if you're using like blas on on some system, you often you can just link against a different library and then get it right. But, for example like with cou blas, you have to actually do some changes to your implementation.

A

You have to write some code differently, the promise of something like that as one interface where they can take control over all of the work distributing it to the dispatching the work to different backends, so they they both are solving similar problems. But in different ways, and targeting slightly different audiences and I'd, be happy to have a conversation with any of you offline. If you're curious, like which approach makes the most sense for me and my good.

A

Cuda also has an implementation or exposure, an API exposure and Fortran, and so it works a kind of similar way where the idea is that you mark up a subroutine with a global attribute, which says this is work that can be launched on the GPU. You can pick out, which thread you are and then do a piece of work on that thread. You launched the work using this. This CUDA syntax with triple Chevron's, tells you how many threads to launch, and it also it's very Fortran like this interface.

A

So you can add an attribute to an array, for example, device in this case, which basically says allocate this on the device. So it is very Fortran like syntax, and this was developed originally about PGI, and there is one other implementation of it by the IBM Excel compiler. So if you're using summit, for example, this is available to you. There too.

A

So these are all available. Hopefully, oh I skipped a slide. Sorry, while I was answering the Cocos question. So what I wanted to say is that one thing I mentioned there were limitations in the c++ approach. Right and a big limitation is that a lot of STL type objects are not available. On GPUs, for example, there is no implementation of standard vector on GPUs. That does not exist.

A

It's a very complicated piece of code does not map well to GPU parallelism right, so you cannot write like who to see device code and then use create standard vector inside it right. So that is a limitation that we have to come up with clever ways to deal with. One of the approaches for this is thrust. So thrust is a library developed by Nvidia, which allows you to write STL like algorithms on GPUs, and this is very typically a host centric approach.

A

So you do something like on your CPU code, create a vector so there's ap is for host vector, which means on the CPU and device vector which means on the GPU. So this creates it allocates memory in both places. We can do point like we can do standard operations SK like operations, by getting pointers to the beginning and end of like a list, for example or a vector, and then we can do operations like sorting on the list or copying arrays that sort of thing. So often this is the way to go.

A

If you have very STL a heavy code that you just want to get started on GPUs and some cases, this will not solve every problem. Right and I have seen plenty of cases and scientific computing codes where this ended up not being a great approach and they really had to just rework it to look more see like right. So I can't promise.

A

This will solve every problem, but if this is kind of the sort of thing your code does, the rest is where you should start before, determining that you need to go to some other will approach, and so there's plenty of documentation on all these approaches, and we have some links here and you can always reach out to me if you want more specific information, so I wanted to close with six ways to Saxby, and so this is kind of a summary of the different ways you can take this Saxby operation and then do it on GPU.

A

So it's kind of just six different programming models that help you do this you've seen most of these already, but this I think helps crystallize. What are the different approaches that are available to you so again, I've already described sexby single precision in ax plus y, and this is again just how do we do this on GPUs in different ways, so with open ACC?

A

This is a notional code, doesn't not fully work our example, but shows you what you need to do or you take your serial C code, your slap fragment, htc kernels on top of the loop or in Fortran you do if c criminals and ACM kernels, and then that the compiler figure out how to do the work. For you, that's version one. The second option was coup Blas. So this is library approach. You can basically create vectors through qu blast, allocate the memory copy data from the CPU to the GPU.

A

Do the work on the GPU and then copy the data back so again, similar same a same result right we started on the CPU. We allocated memory copy to the GPU. We copied it back, but using a library. Now, instead of the directive based programming model, include a C. We are explicitly targeting the parallelism to the work, so we now take responsibility and we say this thread. Does this piece of work right and that's something? That's not particularly familiar, probably to you.

A

If you're used to OpenMP for CPU threading write an open MP like with openings, you see, the idea is that the compiler does that work for you right, you don't have to explicitly at work. So CUDA requires a little bit more buy-in from you as the programmer as a developer, but also gives you much more flexibility.

A

If you can say exactly what piece of work I want to do, carry for each thing and although I think there's probably a you, know: community law out there that CUDA is hard or that you shouldn't use food I, don't mean to imply that I think that CUDA is very straightforward to learn, especially the modern CUDA, be aware of stack overflow post from 2010 right. The language has evolved quite a bit since then, and so I think this is actually pretty straightforward, but also keep in mind performance portability may matter to you right.

A

You may want to run your code on more than just the Nvidia platform. Was that me and.

A

That make you force you to make some choices. I wouldn't say that CUDA is not portable I think you can see this in the case of frontier, where AMD's hip implementation looks a lot like CUDA right, and so one of the things that I recommend is that if you want to prepare for front here, right do well on Summit right, because the architectures look very similar and so CUDA as an example where there is some convergence, I think and how these programming models are being exposed on different vendors right.

A

You won't write CUDA on frontier, but you might look something that looks very similar to it and having some familiarity with how to do this kind of work. How to map work to individual cores will benefit you. Even if you end up using a more portable approach like open, ACC or open MP.

A

So this is the thrust approach that I described. You create a host vector and device vector, and you can do some operation like in this case reusing thrust, transform to loop from the beginning and to the end of the arrays, and then do some operation like two times the first one plus the second one.

A

These are the types of operations that you can do with this STL, like approach include a Fortran, as I said, it's very similar to qu to see but Fortran like so you again you identify which threads you want to do to with which work you launch the kernel using this syntax, and then you can do very nice things that are Fortran like like basically set an array to a value, and if this is about an array, that's on the GPU. All of the work is done.

A

Another hood of figuring out like how to actually do that sort of copy right so to come. The Pilar knows how to interpret that operation, so you can do array assignments like you would in standard Fortran and then in Python at the time uses the first Python example I've, given where I number so number as a library for doing parallelization of loops or or in more commonly universal function, type approaches in Python in Python.

A

A very pythonic way to write a Saxby operation would be like, or one way to write, maybe not the most pathetic way. But one way to write it is you define your function where x and y are arrays are numpy arrays, and then you just do like a you. Just do an implied loop over the elements you return a times, X plus y right, and that's you call that in Python in number number is its bread and butter is universal functions.

A

The idea is that you write it in a scalar way and then universal functions in Python. Allow you to distribute that you know independent of rank, so X could be a scalar or vector, and it would work correctly and then thing you do. Is you put this decorator here right?

A

So this decorator at vectorize, which comes from the number package, says I, want to have a vectorized implementation of this code and then all the work of figuring out how to paralyze that or vectorize that it's done by the Python runtime by the number runtime the syntax is you have to give it the output that is returning and then the inputs and so these objects. You have to specify the data types, that's important, because GPUs don't require arbitrary data types right, I mentioned they're, primarily good at numerical data, and they.

A

So that's why you want to be able to. You want to be able to write operations like 32-bit, floating-point, 64-bit, floating-point or integer data, where you should not be typically trying to do string, for example, string and operations on GPU.

A

Sometimes you have no way around that and we're working on better ways to do that, but, like the bread and butter of GPUs is mathematical operations like floating point, integer math, and one thing I'll point out about number- is that this is not just for GPUs and so number can also be used for accelerating your code on multi-core, CPUs and so taking the time to write expose your code in a way that is amenable to number.

A

It will actually be helpful to you, I think on that too, and that kind of allows me to close I think with the point that Jack said earlier, which is that the work that you do to put your code to GPUs will often benefit your even where it's running now see that time and time again a big part of my job is to work with scientists and porting their codes to to run on GPUs and and I very often see that people have code, that they have a profile potentially in a long time or doesn't work well on the current architecture as well.

A

We all can find easy performance bottlenecks that help the CPU code, as well as the GPU code. Of course, I always get a little bit sad when I do that, because then GPU doesn't look as good. You know it's hard for me to get paid, but but but more seriously, that's that's a big benefit right is that you, it gives you a fresh eye at your code.

A

Right gives you a profession at the performance of your code and allows you to see where your current bottlenecks are, and it's often in places you didn't expect I could think of plenty of cases where it was in crazy places like a case from a hackathon last year where people were concatenated strings together and and see and like that was the majority of their runtime further their data science type application.

A

It was not at all where they expected, and so these types of operations this this workflow of profiling of code, finding the bottleneck and then making it faster is very generic and I. Think one nice thing about the change to big celebrity computing is fortunate to do that. Right. Forced me to take a fresh eye at your code, thinking very carefully about where the parallelism is in your code, then how to target your work through that parallelism. It's not all roses.

A

I would say it sometimes for some algorithms, it is hard to find a way to write that same piece of code in a way that works optimally on both CPUs and GPUs. That requires thought in some cases right for saps fee.

A

Whatever you know it's hard to get that wrong, but there are more complicated algorithms, where the most efficient way to write it on GPUs may not be the same as those efficient way to write it on CPUs, and you know, compiler directives can't solve that problem for you, because it may be implicit in the algorithm that you wrote or the way that you wrote the thing that you're trying to do so I don't want pretend, if there's nothing to do here, but at the same time. This is also why it's fun to be.

A

You know scientific computing, developer right, you get you get an opportunity to think about how to apply your work and take advantage. Modern architectures, so the story that I want to tell you is that it's easy to get on GPUs and with a little bit of care, do very well in them. It takes more effort to get that peak performance right and sometimes it's not even achievable for your algorithm but good performance.

A

Great performers, even as attainable with a modest amount of effort and the GPU computing platform for all vendors has come a very long way since GPUs for scientific computing were first developed and situation is, is pretty good, I think for developer, starting out on GPUs, and this slide, which is my last slide kind of emphasizes- that NVIDIA works pretty carefully with the open source ecosystem to make sure that this platform is open as possible right, and so we work with with the open source compiler developers like the LVM folks.

A

Did you see folks to make sure that we are there's? Those approaches are supported on our platforms.

A

For example, the clang C++ compiler can actually generate CUDA code because it there is a well-defined API into our assembler, so they can generate code that can then be compiled to run on video GPUs and of course the same approach applies for implementations like OpenMP, where that is not an in many cases, not a nvidia implementation right.

A

So iBM has an implementation clang as an implementation of open, MP offload that is not written by Nvidia, but nevertheless takes advantage of the fact that our tools and our API can be used to implement those things on the EP use. So that's basically what I wanted to describe that? This is a platform that is continuing to expand and get better, and it occurs to me that I've never actually introduced myself.

A

So sorry about that, I could do that better late than never so I'm max Katz and the reason I'm talking to you is of course, my job as in videos, a Solutions Architect, so I work with developers to help them understand the Nvidia platform and become more successful, but in particular I'm the Solutions Architect for department, energy right, and so my job is really to work closely with you. If places like North to make your job successful, and so you should absolutely feel free to reach out to me.

A

It's my job to help you with your work and I work very closely with the nursing staff. Of course, I'm training programs like this, but also like on more directed efforts to work with your applications to make them better.

A

Ok. So that was what I want to talk about. I think we have a few minutes left. Does anybody have questions about the platform and anything they want to know.

A

Certainly could do that unfortunately, I don't have a slide here, which would help me demonstrate that, but let me describe it verbally and I'll do the best I can.

A

But what he's asking about is that there are there's a hierarchy of parallelism on GPUs, and this is I think this is true in every GPU platform, for example, this is a very publicly documented for AMD's GPUs that they have this, and the reason that there's a hierarchy of parallelism is that modern chips are very hard to make right and it is hard to make a monolithic chip with 5,000 threads right. That is a very hard thing to do.

A

We get very low yield as a prosecutor, and so the way that we GPU implementers do, that is we make smaller units which, for NVIDIA, are called streaming multi processors that we can then title across a diet. Essentially, so we take one unit, one fundamental computer, you know the SM or the stream LT processor and then put a bunch of them on the GPU, and then they can coordinate to do parallel work and the number of SMS or sweet multi processors.

A

On the device essentially determines the compute power of the device, so our lower lower end GPUs, like for the gaming GPUs, typically have fewer ones. They have less total compute power, less Rock, compute power and the bigger GPUs have more of these multi processors and essentially the same architecture, but with fewer or more, and that determines how much compute capabilities available to you.

A

So that makes it easier to write to build a chip that has massive amount of parallelism to have this hierarchy of parallelism, and then that plays into like the memory structure, to where each multiprocessor has l1 cache right. That is independent of the l1 cache for other stream, LG processors, and so it that way you can have those threads communicate with each other with on that multiprocessor, but cannot directly communicate with each other on the other parts of the chip, and so in this syntax.

A

That we saw before in CUDA here to see we have to do is specify. How is our presence tributed across those two levels of parallelism? So the first number in the in the triple chevron? Syntax is the number of teams, or groups or included, speak thread blocks. So this is this: has teams of threads or groups of threads that do work in concert and those threads are targeted? They live on a particular multiprocessor, a particular SM, and then the second number is how many threads heard that for that team or a group right.

A

So this concept is also exposed and open. As you see in the concept of gangs or openmp in the Kosmos of teams, there are groups of threads and then that second number is the number of threads in each team or group or in the CUDA speak thread block and there. The total number of threads that you can run in a group is about as 1024 on a NVIDIA GPU and then that's the highest level parallelism at that lower level of parallelism.

A

And then, if you want massive parallelism, you have to combine many groups of threads together. That's what that second number is. So it requires a little bit more work for you to understand how to map your work to that to level up arrow key, and that's one of the nice things about opening TC. The compiler makes an informed choice, a heuristic guess about how to map that work effectively.

A

So you don't have to do that, but you will probably, as a parameter, optimization want to do that at some point, because you will get much better performance if your parallelism is mapped well to the architecture, and that is again a true statement across GPU implementations. It is not possible to write a good GPU for the most part at yield anyway, like Sarah brass has an example of a vendor that has done one model at the chip right, but for most modern GPU influence.

A

It's like that where you have in computing, it's like multi processors that are done tiled across the chip. So you have to be aware of that for maximum performance. Any other questions.

A

Are we good on time.