National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 5. Nvidia Standard Language Parallelism, C++ -- Matt Stack

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

uh Hello, my name is matt stack and I'm a solutions architect at nvidia. um This section will cover the sequest plus standard language parallelism with our nvidia hpc, sdk c, plus compiler. um This will be a brief overview, but hopefully it will give enough background for you to get started.

A

um Cpus plus standard parallelism was introduced in c plus plus 17.. It recently saw an extension in c plus plus 20. The most recent version of the standard and more features are coming in the pipeline for the future um cbos 23 and onward, and, as jeff mentioned in our overview talk, this has been a very important feature in the standard and it has been an investment over many years.

A

This feature can be implemented by the various algorithms that have been given that have been given an execution policy. Extension in your code, for example, stood um execution parallel or sequential.

A

I would like to briefly mention some of the classic algorithms, such as sort reduce, transform and for each most are pretty self-explanatory. If you have not seen them or are not too familiar with the algorithms of the c plus plus standard library, um usually they take iterators and then do whatever the function.

A

Name is like sort um for each is going to help us a lot in this case, because I think it is an easy way to visualize parallelism and for each applies a function that is given as an argument to each element in the range defined by the iterators, and you can think about it like if I wanted to give out different elements of my container without my control of order in which they execute.

A

um If you use the parallel execution, how would I need to structure my algorithm, which is tip, which is typically a lambda in this case, or a function object to abide by the rules of good parallelism um as brent mentioned, such as no data braces, and that's just something to keep in mind as we go forward and brent mentioned this as well, but um I'll repeat that nbc plus plus standard parallelism also uses cuda, unified memory for memory management, meaning it is handled by the compiler.

A

Okay, um if you are a c plus plus developer, you may have used cpp reference in the past or maybe every day.

A

I pulled the screenshot to show that after this talk, if you want to explore the options that stabar enables you look for declarations that have a universal reference to a um execution policy as a parameter- um and you can see- I boxed this one out and also you can see that it is since c, plus plus 17, because that is the version of the standard it was introduced.

A

um Nvidia does not own this content, so you know it's not like a replacement for actual nbc, plus plus documentation or anything and below is a simple semi-complete example with for each using the parallel execution policy um and we have a vector- and we are performing an algorithm on each element in that vector we will which we will bring the work and the data over to the gpu for execution, and then we will bring it back to the cpu after it's done and just for note for the future, when you want to give this a try, you'll want to include algorithm and include execution.

A

Just a few details to help you get started when using the parallel execution policy make sure there are no data, races or dead locks, as brent mentioned, because this is left up to the programmer to handle. um When you declare execution parallel or execution parallel um unsequenced the two parallel um options.

A

You are saying that there is no dependency between iterations and the compiler just um trusts you um and, as we mentioned step part uses cuda unified memory to handle our data transfers between cpu and gpu. For us, and just a note that unified memory requires data to reside in heat memory, so this means stood vector is all good um stood array would not because it resides on the stack and quick two notes that um functions. Reference do not need to be given the device annotation like they are included c plus.

A

If you were familiar, um the compiler just automatically goes through the call stack and handles this for you as well, um like the data um and also execution on the gpu, requires um random access. Iterators not not forward iterators um anti-compile using stood par um using our nvc plus plus compiler um use the dash studpar flag, and we have two options with that: um you can use equals gpu, which is the default.

A

So if you specify nothing, this is implicitly around equals gpu and you also have equals multicore and, as jeff mentioned earlier, the saved code can be can target gpu and multi-core cpu by a simple option: switch.

A

And this is a very simple workflow that we can walk through. um Imagine that we have a problem, there's a vector I want to sort, and you can see we have a vector called effect one, and it has just for this example. 10 unsorted dents and our solution simply is just apply. The standard, algorithm, sdd sort to the vector- um and you can see that we're all sorted out below and stifler comes into play on the far right as a potential performance improvement very easily.

A

We add the stood execution power to our function, call and then later we remember to compile with step bar dash depart and then that's pretty much all you need to get this code and this data onto the gpu um and I've noticed. I have noted potential improvement here, because I will say that if you give a vector of 10 inch to a gpu and that's it, um you won't exactly be breaking light speed.

A

um If you think about it, the speed up would need to be at least worth the cost of the data transfer um as sprint touched on a little bit, um but fortunately we can just apply our same basic knowledge um as brent, provided and we'll touch on just throughout these two days of what makes a good gpu program and make sure our code follows those same guidelines um and just generally, there needs to be plenty of work and data to keep the gpu happy to see a performance increase. Utilizing the hardware.

A

And jeff already gave a great overview about this cool work that professor has done with um the lattice boltzmann simulations using steadpar. um I just wanted to advertise it one more time because the gpu, the gtc talks are really good um and I reference them quite a bit and they're just down here at the bottom.

A

um And that's it for the overview, um thanks for your time,.