National Energy Research Scientific Computing Center (NERSC) Parallelware Trainer Tool Workshop, June 2019, 6 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 4 Accelerating code with Directives

Description

Part 4 from the Parallelware Trainer Tool workshop at NERSC on June 6, 2019. Slides are available at https://www.nersc.gov/users/training/events/parallelware-tool-workshop-june-6-2019/.

A

Okay, so you have learned by examples, those of you who had never seen before openmp open you see now you are familiar with the syntax with some of the fundamental pragmas some of the fundamental closes, but there are some additional pragmas and closes that you can use during the practicals. So here I will try to introduce the additional primers and closer that you will be able to will be able to use you in the practicals.

A

Okay, parallel really, nothing else to explain. Remember parallels where you defined the parallel region until that point single threaded code after the parallel region, single threaded code. Again so at the beginning of the parallel region, threads are created all of them work in parallel at the end of the parallel region. All of them are destroyed, except for one that continues the single threaded execution. Okay. So this is essentially what we are explaining here, and this is the syntax in C C++, an important so I think it doesn't make me sense to stop more here.

A

Kernels kernels is something that is only available in open. If you see is not available in open MP, we have included this because a common question is.

A

What pragma would provide me more performance, kernels or parallel in some sense, both of them specify the beginning and the end of a parallel region? What's the difference, the difference is that in parallel, the one we have been using and the one that he will really suggest or recommend to use you are developers are responsible for use doing best practice using currently the problems and the closest, if we don't use them properly. The parallel code that we will create will be incorrect. Okay, it's our responsibility.

A

Kernels is an attempt of open SEC to release to get so that the primer can get rid of that responsibility so who discovers the parallelism if it's not the programmer, using a pattern based approach or a classical approach of tryin tryin test different pal implementations? Who does the discovery of parallelism here? Who implements generates the parallel code? For us? It's only one piece of software I can do this. The openness is a compiler. Ok, the open is a compiler as any other. Any compiler have capabilities to discover parallelism in real code.

A

The main problem, the main limitation, is that they use the classical approach of dependence, analysis and data flow analysis that is used in compiler theory, something that comes from the seventies so thirty years ago, and essentially this classical technology doesn't work, doesn't discover parallelism effectively in real applications when you have loops with procedure, calls with structs of pointers with potential aliasing, an alias in to their arguments that are in the invocation of a given routine control flow.

A

All of these things that we use in every single call defeats makes the classical dependence analysis and data flow analysis ineffective. It doesn't work to this core parallelism. Okay, but again, if that technology somehow improved, then kernels would be a way for us to get rid of the responsibility of finding and discovering the parallelism and implementing the parallel version. Ok, but the reality is that the state-of-the-art today, the compilers, cannot are not very effective at discovering parallelism in real code.

A

Ok indeed, in the newly simce example that we have in the practical you will, you can check kernels, and you will see that no compiler can discover parallelism that parallel whatever you can discover, because we are using a completely different way of discovery. Parallelism ok is the IP of the painter of the company that we have incorporated five years ago.

A

So, but anyway, it's important to know that this exists so that at some point you can even try and test the difference in performance between parallel pragma, using the pattern based approach and kernels to see how far can a compiler get and make this job for us. Ok indeed, some of the practicals we proposed in can explore this part of comparing the performance that you get with parallel and with Colonels directives. Okay, so it's good that you know how to use that is. It exists, a list, the syntax variously pragmas.

A

You see relative kernels with some additional glosses.

A

Okay- and this is the discussion between kernels for parallel- this is more or less what I have just said at this moment. Okay, next we have already seen loop and for work-sharing.

A

If we don't specify a work-sharing, then our loop iterations are being replicated in each of the threads ten iteration one thread: ten iterations in the execution, ten iterations to thread ten plus ten. It's ready execute in 10 20 threads 20 threads x, 22 ratios, but we don't want the replication of our code. When we go to parallel situation, we want to divide the workload among the threads, not to multiply the workload by the number of deaths. Okay, so work sharing is essential to specify parallelism, okay and to really have a parallel version.

A

So here I would just add that important, instead of the keyword for to have the typical Fortran syntax of the two keyword, okay, so for loop, do loop inside the keywords that you use for the directives in C, C++ and Fortran in OpenMP and open SEC.

A

Okay, three levels of parallelism work sharing on the CPU is simple. To understand. I mean one thread begins. The execution I find the parallel region, I create 10, threads I have 50 iterations, so 52 ratios between 10 threads, it's radius allocated is assigned 5 iterations and all of the threads can communicate a synchronize with all of the all of the threads. So the execution model of the multi-threaded CPU is simple to use.

A

Ok, but that is not the case of the GPU remember when we began this morning that we said that the GPU have a phlex memory design so that you have a hierarchy of memories in the cpu. You have the main memory and the cache in the GPU. You have the main memory, the shared memory, the cache, the scratched paths, different types of memory and not all threads. Can access to all of the memories does the main difference with a multi-threaded CPU. So there are restrictions that are imposed by the server.

A

So how do we as programmers can lead with this complexity? Ok, open, MP and open, as you see, provide a way to handle this. That is when you do work sharing. You can specify work sharing at three levels. Let's call it generically coarse grain, fine grain scene, director called grain, means that when all the threads are created, imagine 100 threads. These threads are grouped by groups. Imagine that the groups are of 50, so two groups of 50.

A

What this means is that each group has a representative thread that can communicate with the representative of the other group, but not with the other threads of the whole other world. Okay, so this each of these Gantt threads can communicate with its workers using open, SEC terminology and can communicate with other guns, but not with workers of other gangs.

A

Okay, so the GPU is a crucial model and the open SEC on open, MP execution model for GPO provide this functionality to somehow simplify the control of the how the threads are grouped on the GPU automatically by the hardware. Okay, so in open ECC, we have a clause that is called gunk. What gang means is that when you specify a reduction, you can say I want to make a reduction at the gang level.

A

What this means is that all the gangs of each of the groups will collaborate will communicate with each other to make the reduction of all the local partial result computed in each of the runs. So you can do a reduction between guns. If you specify a reduction at the worker level, you will not get the correct result that you expect why? Because at the worker level, you will have workers within this gun, making a reduction workings within this gun, making a reduction, but the guns will not communicate to make the final reduction. So with you.

A

You will not be current in atomicity and mutual exclusion to make the reduction correctly okay. So even these levels of parallelism that I usually used to increase optimize the performance of your application on the GPU can even lead to incorrect code that produces, in current numerical results. For instance, using reduction operations reduction operations are defined in operation C to work only at the gang level, not at the working level, not at the vector limit in openmp we have an equivalent. We have again three levels, so the current level is a specified by the teams.

A

Distribute the worker level is a specified by the parallel form. Okay, an additional level is usually seen do vector so within all of these, the threads are somehow tied to each other so that all of them are used in different length of the vector hardware, so this happens on the GPU using vector- and this also happens on the CPU. If you take multiple threads and you back to eyes some inner loops within the multi-threaded code, okay, so the importance of the three levels of pollution for the GPU again.

A

Just to summarize, do you need to remember that this exists? Do you need to remember that when you are doing Foresters reduction operations, you can only make reductions on the gang level? If you do it at lower levels, the result will not be as expected, because all the threads can not communicate with the rest of the threads. The result will be incorrect. Okay, so this has impact in performance but more important, even on correctness. So we need to be aware of this.

A

Any questions all this just be aware of this. When we want you to the practical, okay atomic we have already seen atomic, so we have atomic available for C C++ in OpenMP, and we also have atomic available for the C++ important in open ACC. So the parallel loop with atomic protection to implement reductions. We can use that strategy to make execution in parallel of reduction operations in the GPU and atomic operation. The GP were extremely effective. There is a great improvement in the hardware support so some years ago they were very costly.

A

You cannot expect a good performance increase right now. The atomic operation of the GPU are really highly optimized. So you is something that we can use to you to create parallel code on the GPU using attorney protection. Ok, so you can do it and you can play with it in the paragraph 22 target and data. We have already seen this remember that when you go to the GPU, remember the GPU secretion model of loathing host driven the holster starts at some point.

A

He decides that part of the code is offloaded to the GPU, the code to be executed sent, but we all need to also send the data that is needed to make the computations visible widow with data in open ECC with target data in open MP, and then we need to do ways to control code data transfer from the CPU memory to the GPU memory. This is copying an open ACC map to in openmp to copy data.

A

Worst a result is computed on GPU transfer back to the data to the CPU, so that we can see the output remember that the execution is caused driven so copy out or map from, and the recent data that you'll probably want to copy in and out for some reason in your application, so copy or map to fro. Ok and you can see parallel trainer generates copy in copy out copy and map closes for you to handle data.

A

Okay, I think that this is this almost the end, so but it's also an important part: a race shaping.

A

Imagine that you have an array of 1 million elements and you have a loop that processes 1000 elements you want to offload those computations to the GPU. Could you transfer the 1 million elements to the GPU? No, you would want to transfer only get the region of the code of the array that is really used during the computation.

A

So you need to specify somehow that from an array of 1 million elements, only these 1000 elements at the start here and then here is what needs to be transferred to the CPU for the CPU memory to the GPU memory. Ok, this is what we call a reshaping, so we have a reshaping from for 1d arrays to the arrays 3d arrays, multi-dimensional arrays and the way we specify this using the same centers that we use to allocate arrays statically in memory.

A

You see we can create flow out, X bracket 1000 in bracket and this statically locates an array in memory in Fortran. You can create also arrays using I, think it is the parentheses notation okay. So essentially it is the same notation and you specify where the elements start and how many element you want to transfer okay, and this is essential to minimize the data transfers from the CPU to the GPU and back from the GPU to the CPU.

A

So at some point in the data transfer in the data closest directives, you will need to specify here water race. You want to transfer and what shapes of the erase. What read some regions of the race do you want to transfer to from or two from the GPU okay.

A

Finally, only two slaves remaining remember that we say that if we want to learn how to paralyze real code, your code has loops that call routines. Your code has looked at call routines if we cannot paralyze a loop that contains a route call to our routine, we're in trouble. Okay, so we need somehow a directive where we can mark what routines need to be executed on the GPU.

A

Remember that when you compile code for the CPU, the binary code runs on the CPU architecture, but the GPU has a different architecture, so we need to compile a different version, binary code of that routine to be run on the GPU her world. So how do we specify this in openmp an open sec? If my loop, as you will address in tell us m'kay practical calls, functions, I need to say this function and this function will be offloaded to the GPU. Please compiler generate a CPU.

A

Finally, a binary version to be secure in the CPU and also another binary to be executed on the GPU and in both them as you need. When the code is offloaded. All the code is executed on the CPU, so this is what routine does in operation C, and this is what declared target does in Opera B. So if you have a full, apparently loop, imagine that this is a fully parallel loop. So a parallel for all. You need to specify that foo will be offloaded so that the gender binary is generated for the GPU.

A

So in the Declaration of the older function before the signature, you have pragma CC routine, and here you have several modifiers. We will use SEC only for this particles and with this you currently that the compiler will generate a version of food that will run on the GPU whenever it is needed to get to offload disk, ok, something that people usually does. Is they inline the routine? What this means is that, instead of using routine SEC, you take all the body of the routine and you a place, they call with the body of the function.

A

You can do this, yes, you can do you avoid the call to the routine? Yes, you avoid it, but you'd make your call less a structure. You are going against right in a structure, well the structure code that makes your comment anal. So it's better practice to use this instead of inline in their routine, where it is going. Okay, you will have to play with this in the u.s. m'kay practical and finally, those of you that work with C and C++. You know restrict and cost.

A

They are not explicitly needed, sometimes to generate parallel bases or parallel code. But some compilers may request that some of the argument that you have, for instance, in the signature of a function when there are pointers that you explicitly mark those pointers, are pointed that cannot allow us one another. They cannot overlap in the regions of memory of reproduce of memory that you can access by the references. The reference in that point- okay, so those of you that are familiar with pointers will see a C++.

A

You will probably find this tree straight and useful in Fortran. You usually don't address these issues because Fortran by default is you allocate arrays and they are located in separate memory regions that cannot overlap, and a apart from that when you use pointers, you have a restricted implementation of pointers that is not as powerful as this one. But again this makes programming easier and writing compilers easier in some sense. Okay, so this is essentially something that you will probably need for C and C++ real goals.

A

Okay, so we have emitted this practical of heat for this course here, but I think you have the example of heating the injured in the participants materials that Helen has shared with all of you. So again, heat has some combination of patterns of loops, so you can play with all of the patterns that you have seen here, but for this after Lance will recommend, instead of using the making the knowledge, the heat practical that you really go into the complexity of the Loula sh and k practical.

A

So we can help you to understand the complexity of paralyzing real codes and how the composition in components and the composition impetus can really help you to understand how to paralyze real codes, even if it is the first time that you see the code, you need to understand the science behind the code. You will need to find properties in the code and these properties are remarket captured by the patterns themselves. So that's what you really need to change in your menses. Okay,.