National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Using GPUs as Accelerators

Description

Max Katz from NVIDIA presents a talk on Using GPUs as Accelerators. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Oisín Creaner

A

Great thanks for inviting me so my name is max katz, I'm a solutions architect at nvidia and I serve as a technical consultant between nvidia and the department of energy.

A

I want to share with you my perspective on using gpus for science, and in particular I want to share with you my perspective on what I think is meant by the term accelerated computing, which is really the new paradigm for doing hpc.

A

So in the concept of accelerated computing, you have heterogeneous nodes which are combinations of cpus and gpus and jay showed you what this might look like in the context of promoter where you have a cpu that is very well optimized for serial tasks like input and output and other calculations that don't have any parallels into the nature or at least limited parallelism, whereas gpus are optimized for highly parallel tasks, and you combine these two processors in order to effectively uh solve your problem, and so that's really the new norm that we're in where you need to use both of these things together to solve problems in science and so cpus are latency, reducing architectures and what we mean by that.

A

uh That's a synonym for saying that they're optimized for serial tasks that they have very large amounts of memory. They have very high clock speeds and they have very large caches which help them to reduce latency, and so, when you're trying to access some data from memory, for example, cpus are very good at ensuring that the time it takes to access that memory is as small as possible, um but they're not good at everything, so they have relatively low memory bandwidth and if the cpu gets it wrong.

A

If, if the data that you were trying to access is not immediately available, then that is costly and in terms of energy efficiency. They are relatively low efficiency, if you think about it in terms of a number of floating point operations per watt, say, conversely, gpus exist to hide latency, and so they have very high bandwidth memory but low capacity memory.

A

But on the other hand they have very high amount of compute resources and they are high latency in their access. So when you have one thread accessing a particular date item of data from memory that takes a long time, but the way that we help avoid that problem is by having many threads. And so while one thread is trying to do something, and it takes some time to fulfill that request.

A

We have many threads that are doing something in order to hide the latency of that request, and so their individual performance of any one thread is relatively low compared to a cpu thread. However, by combining many threads in parallel, we can solve problems effectively and in terms of energy efficiency. Gpus are relatively high efficiency, which kind of goes hand in hand with the fact that the individual cores that make up a gpu are relatively streamlined. They don't do a whole lot, but the things that they do they do well, when combined in parallel.

A

So gpu acceleration is the task of taking your application code and then identifying the parts of it that can run on a gpu and putting those parts in the gpu and then leaving the rest on the cpu and, in terms of say the number of lines of code uh you might put a relatively small fraction on the gpu. uh It really depends on what your application looks like, but it might be just a few percent of the lines of code that make up the time that it takes to dominate the runtime of your application.

A

So gpu acceleration is this task of again identifying parts of your code that makes sense to run on the gpu, using some strategy for putting those on the gpu and then hopefully making your runtime go faster as a result, and so you're using the cpu and gpu together to get a faster result than you would get with either one alone.

A

I want to tell you a little bit about the gpu architecture, because I think that will help make sense of this claim, I'm making about gpus being massively parallel devices, but also ones that really require a massive parallelism to succeed. So jay showed you a a an image of what the chip looks like from the outside. This is kind of an inside view of the latest nvidia gpu, the a100 gpu.

A

So we announced this a couple of months ago, and this will be the gpu powering uh the gpu nodes and promoter. uh The.

A

If you pay attention to one thing from this slide, uh it's to pay attention to the number of what we call cuda cores, which are and I'll explain what this term means in a moment.

A

But you can think about it in very rough terms as the amount of compute power that the gpu has and you you probably know that the more core as a processor has that that should that's usually proportional to its compute power in some sense, and so, if you think about modern uh cpus that you're familiar with the ones that might be running on your laptop or desktop, they typically have a few cores, maybe tons of cores in the higher end server nodes.

A

That's certainly true for the the high-end cpu nodes that we have see running today. They run like 10 to 100 cores, by contrast, gpus have thousands of cores and that allows them to solve problems massively in parallel, but the catch and I'll show you this in a moment. Is that again those cores don't solve every problem.

A

Well, they're optimized, really for a small subset of problems, particularly ones that are really mathematical in nature, and so you shouldn't be trying to do relatively complicated operations on gpus that involve lots of conditional logic or other parts of computing that don't really relate to simple mathematical operations.

A

Those aren't really optimized running on gpus um gpus have relatively small amount of memory compared to cpus, and so this gpu has 40 gigabytes of memory, uh but the gpu bandwidth is much higher than typical cpu, and so the typical cpus that you see today at the high end typically have you know: 100 200 gigabytes, a second of memory bandwidth, whereas modern gpus using high bandwidth memory can achieve something like a terabyte or more of hp memory, and this trend is true across all gpu vendors.

A

This trend of having relatively small memory capacities, but very high memory, bandwidth.

A

And I want to zoom in, and so, if you look at this picture, you see that it's combined.

A

If you look closely, you can see it's combined with several little chunks, which have a little bit of green and blue in them uh and in total, there's um 120 of these on the screen and we take 108 of them and turn them into what we call the a100 gpu.

A

Sorry, just one moment.

A

So we take 100, we take 108 of these sms and we turn them into a.

A

A gpu, and so a larger gpu in most cases, uh has more sms and a smaller gpu has fewer sms, and so uh the high-end gpus are still like 100 of these sms. These are the individual multiprocessors that are then tiled across a die to make a gpu.

A

Now when, if you look at one of these, these things that are tiled across a die to make a gpu.

A

uh What you get is that uh you have something like exactly exactly 32 double precision, 64 fp64 units and 64 fp32 units or single precision units, and so it might be best to think of one of these sms as the equivalent to an actual cpu core that you're familiar with, and when we say a term like cuda core, what we really mean, it's something more like an arithmetic logic, unit or floating point unit, and so um that's another way to think about.

A

uh Gpus is that they have a very large amount of floating point and integer units and so they're very good at floating point integer math, but they're, not so good. At other problems like um that, don't involve math. They additionally have these tensor cores, which are well optimized for solving matrix multiplication problems uh and they have a relatively small amount of cache memory, and so, while gpu cpus may have something on the order of you know hundreds of megabytes for caches, you typically have much smaller amounts for uh gpus.

A

On the other hand, any one of these individual sms or what you might call um cores, can have up to 2000 threads running on them at one time, and so, if you uh were to expand this out and take the 108 sms that make an a100 gpu and and multiply that by the 2048 threads that can be active on them at one time, you could have up to 200 000 threads, something like that running on the gpu at one time, and so that's really what these gpus are capable of.

A

So my takeaway, for how to use gpus effectively is that um you have to expose massive parallelism, and so again you have to be solving problems that are relatively mathematical in nature, dominated by floating point or integer arithmetic, and you have to be able to hide the latency of any individual floating point or integer operation by combining the result of many many cores, and so I've, given you that number below more than 200 thousand threads can be active on gpu at one time, and so this requires a quality qualitatively different level of parallelism exposed than you would typically find on cpu architectures, even on some of the more modern cpu architectures like k, l cpus on corey didn't require this level of parallelism exposed to be successful, but you have to use hundreds of thousands of threads to be successful on a modern gpu, and what that means is that your problem needs to have at least that many degrees of freedom, and so when you think about the size of your problem, whether it's the number of uh elements in your grid, if you're doing some sort of good calculation or something like that, you need to have something like 100, 000 or a million degrees of freedom.

A

If you're going to be able to use a gpu effectively.

A

Now a little bit about the gpu memory.

A

So if you take one of these sm which are kind of the building blocks of a gpu and then you kind of lay them out horizontally like this, they have different memory levels in their hierarchy, so they have registers which are the actual source and destination of the computation on the gpu. These are on the on the actual chip and then those these are what the floating point units and integer units use to do their math. So those are the high bandwidth and very low latency access memory.

A

You also have on these um sms or or individual cores. If you like to think about them at l1 cache, which has the the close cache to the to the chip, then you have a device-wide, l2 cache and then a main global memory, and so, if you then lay those out uh from closest to the chip and therefore highest bandwidth, but also uh smallest in capacity, you have registers which have only a few tens of kilobytes per sm or core, but can be accessed at much greater than one terabyte, a second worth of bandwidth.

A

And then a similar thing is true for this l1 cache that it lives on each one of these sms or cores. If you want to think about it that way again like 100 kilobytes per sm, but can be accessed at more than 10 terabytes. A second, the l2 cache um has a size of 40 megabytes on an a100 gpu, an fba 100 gpu.

A

It could be accessed at like five terabytes, a second something like that, and then the main ram has 40 gigabytes, and so you can see it's orders of magnitude larger, but it's also slower in excess, and so, uh of course, as is true for cpus, you want to make use as much as you can of the memory that's closest to the the chip.

A

I'll just leave a couple tips here um that you can look at later, but the main takeaway that I want to emphasize is that just like you won't be able to saturate the cpu performance, uh the I'm sorry, the the core compute core performance without having many threads. You also won't be able to do that.

A

You won't be able to saturate or maximize memory bandwidth without having many many threads running.

A

So then. In summary, your main priority is to make the code faster and that doesn't always mean making the part. That's on the gpu run faster, um really. Sometimes it means optimizing data transfer back and forth to the gpu because they have their own separate memory spaces and sometimes it just means refactoring your cpu code to expose parallelism a lot of codes that exist today before they start according to gpus, don't have their parallelism exposed.

A

Naturally, you have to spend a lot of time, or at least some time making that true, and so that's a big part of the gpu reporting process and so don't go in and start writing cuda code or open ac code or anything like that to identify what are the parts of your application. That could benefit from parallelism and is it exposed in that way, and you should always use profiling tools to help you with this.

A

It is generally the case that you will hear lots of uh received wisdom about what works and what doesn't on gpus, and I encourage you not to to try to live by that wisdom, but instead experiment with yourself and then use tools like profiling tools to help you understand uh whether you're succeeding or not. So with that I'd like to thank you and I'm open for any questions.

B

Okay, so can we get some questions from the audience there? If there's a see, if there's any hands raised or pop the questions in the q, a which will make them easier to.

B

B

Okay, um I suppose a great question I have is uh sorry. Do we have an example, one um so there's a question here from an attendee saying: can you give some examples of codes that are not appropriate for gpus.

A

Well, the the most common example is one where you just don't have a large enough problem to solve, and so you have, uh like I said, gpus require problems that have many degrees of freedom exposed. uh You have to be able to utilize hundreds of thousands of threads at once.

A

That means your problem needs to be able to expose hundreds of thousands of independent pieces of work that can be solved simultaneously, and so, if your problem just doesn't express itself in that way, like you're solving, you know, your main application is solving matrix multipliers that are like 10 by 10 and you're only doing one of them at a time. That's not appropriate for running on a gpu it'll actually be faster to run that on a cpu.

A

um Now, sometimes you can express your code in a way where you can batch operations so that many are being assaulted once and then maybe you have some chance of using the gpu effectively, but really it's the. The main pitfall is when your code just doesn't have enough work to do to saturate the gpu and then you'll actually end up being worse than if you just used the cpu alone.

B

Thank you for that. um Another question here from konstantinos do gpu profilers work well when dealing with multiple npi tasks, slash gpu,.

A

Gpu profilers, this will depend on which profiling tool you use. The nvidia provided profiling tools certainly have the capability to work when using multiple mpi tasks per gpu. They don't the ones that are currently available. Don't have the the greatest ability to combine that in an mpi aware way to show you, for example, multiple ranks running at the same time, but they certainly can run and you can analyze e-rank independently, and if you look broader than the nvidia provided products, um that's something we're working on that.

A

We don't have today um uh to some of the third-party libraries like scorpion, vampire, total view or other well-known third-party profiling tools and debugging tools. They generally have been programmed well to run at scale. So those are also good things to look.

B

At great- and I think one last question uh any further questions: do uh please put them in the q a but we'll have to answer them off uh off the chat here. So this one comes from akil.

B

You mentioned a 600 gigabit per seconds and nb link. Throughput is this per pair of gpus? Can you say something about the number of links available and their energy cost.

A

Well, the envy link connection is a is a proprietary connection that nvidia created uh that allows gpus to directly talk to each other at very high bandwidth compared to the standard, for example, pcie interconnect between uh chips on a motherboard and 600 gigabytes is the total bi-directional throughput that a gpu can talk to all the other gpus in the system, but in reality that throughput is going to be divided across multiple gpus and so um there's actually 12 of these links per per a100 gpu, and then you split them up across whatever interconnect.

A

You're using and so, for example, it'll, look a little bit different from on the pro motor motherboards than on other systems. Where there's you know more or fewer gpus per node, uh but that's customer gearbox is the total bi-directional throughput from any one gpu to all of the other gpus in the system and then just gets divided across the other gpus.

A

I can't really talk about their energy cost. It's not really something I'm prepared to meaningfully speak about, um but this will be essentially the the default way that you communicate between gpus on on promutter, and so it's something to be aware of that. You have much higher bandwidth than you might be familiar with using standard pcie.