National Energy Research Scientific Computing Center (NERSC) Parallelware Trainer Tool Workshop, June 2019, 6 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 3 Parallel Patterns

Description

Part 3 from the Parallelware Trainer Tool workshop at NERSC on June 6, 2019. Slides are available at https://www.nersc.gov/users/training/events/parallelware-tool-workshop-june-6-2019/.

A

The break is finished, so we need to continue with this agenda of this peril where tools workshop so before the break we, it was our really great session. We had a so many question, interaction from you that I really hope that we keep on having this great learning um of fear and essentially what we introduce. As you remember, the very basic concept.

A

The minimum causes that we need to understand what we are doing when we go to the GPU and we saw the key differences between the CPU and the GPU, and we also saw examples of how the open ACC an opening pickle looks like for multi-threaded CPU and for GPU, and we used. We did this without really introducing the formally that there's a mantis of the pragmas on the process. We somehow learn and God used to the syntax, to the examples that we show in the demonstration of the of the tool using this PI calculation example.

A

So I'm very happy, because we also covered many many of the features that we have available in in the tool, and we also discussed many interaction issues that might appear with the file system. With other tools with workflow, so it was really really a great read session. So now we are going to jump into the more.

A

New key concepts or distinguishing part of this training that is learning them the families of parallel patterns that we are considered, considering right now, learning what different policies and strategies. Sorry, we have for each of these patterns and this politician strategies: where can they be applied to CPUs GPUs multi-threading of loading tasks in open, MP, open sec?

A

Imagine we have many many combinations that we can create many different parallel versions: implementations of the same simple code: ok, so let's go into lecture three parallel patterns and and no then he tells about all of this key part of the of the trading.

A

So first I would like to begin with thinking about what open, MP and open ACC really do for us. These are texts extracted from the specification of open, SEC and I, really like to highlight this part more or less. Everything is a consequence. Off of this. Programmers need to be very careful that the program uses appropriate synchronization.

A

What this means is that, in practice, openmp open SEC, don't guarantee that the code is correct. It is us as programmers that have to guarantee that the pragmas and closes that we add to the code will behave correctly in parallel. Open MP and open ACC provide a great support from the compiler that makes all the hardest stuff of generating the call to the multi-threading library of the operating system.

A

That's the hard part of the word that the compiler is doing automating that part, but we are still is possible for selecting and writing the appropriate pragmas, the appropriate clauses and the appropriate options to other closes. If we make a mistake in one of them, the program may be incorrect. I knew the problem is incorrect. Don't blame on the compiler? Don't blame on the standard? Don't blame! The machine is that we wrote incorrect code, okay, so open MP and open.

A

If you see make programmers responsible for making good use of the pragmas and the and relatives. So we need to learn best practices of how to use the promise and the closes. We have reference guide. Quick Start guides great tutorials that explaining a lot of detail the semantics of its pragma. It's close it's option and what are the variants between C, C++ and Fortran and different compilers? But what is really missing in other Mediterranean is how do we relate?

A

How do we combine all of these bricks in the best way for our code, combining in many different ways, which is the best way and why this is what the patterns will provide us the knowledge to know how how to decide? We guide us in the decision of what pragmas and Rosie, who should we use and how we should combine them for our specific, oh okay, so the patterns, the composition of the application in patterns will help to make good use of open MP and open acc will speed up the palace ation process.

A

This is important because if we don't invest time in understanding this, the other approach is try. An error. I write, parallel for build run, runs correctly, I think I'm good. My code is correct, but maybe is correct. For that run, maybe another run provides incorrect results or maybe I change. My input data, a different problem and my program that was correct for a given data set now behaves incorrectly and again it's my responsibility to do so.

A

To do it correctly, so it kind of speed up the politician process, because we invest a lot of time trying to the back and trying to find the bugs and fix them incorrectly written particle. The question is: can we avoid making the most common mistakes when we first add pragmas and clauses and avoid these common mistakes and save that time in development and use that time? For other purposes? Can we do that? The patterns will help you in that, and also the patterns are based on best practices for parallel programming.

A

The politician studies that we are supporting right now we will see in this as idak, are based on the analysis of, for instance, the coral benchmarks. We have analyzed all implemented implementation of the coral benchmarks. We have published a scientific paper that can be downloaded from supercomputing two years ago, where we analyzed all of the implementations, and we came up with the conclusion that, with collaborators from Julis and orey's national law, that the three implant implementation of we are supporting right now are the most widely used implementations of the patterns that we are supporting.

A

Can you implement it in a different ways? Yes, you can, but we tried to imitate and promote this practice of what the expert developers have done when they develop the core advancements and try to have that knowledge and incorporated in a tool so that you can learn from that and apply to your codes. Okay, so based on that, it is likely that if you choose the right implementation, if you for the given pattern, then you will get a good performance. I know same peak performance, because peak performance is always very hard, but good performance.

A

You can get it up. We will show it today. Okay, so.

A

We said this morning that we know the real codes are large and complex, so we need to approach real codes from a different perspective. Let's use components for that and I talked in terms of components. So we start from the signal code and the serial code. We need to analyze it in terms of component by component I mean different types of components, one type of component, our scientific components. We try to avoid always to reinvent the wheel.

A

If someone has called in a library the FFT and it's efficiently coded for a given platform, while recording it myself investing time in debugging it, instead of just calling and linked into the library that provides me this scientific component, okay, so a part of the component and important part, sternal libraries that we use in our scientific codes, because we have sub problems already efficiently solve on a given platform and is the work of the sub of the support from the centers to provider, with machine with libraries that are optimized for the system.

A

So we just have to link building executable and it will make a very efficient use of this scientific components. Okay, but that is not enough if my problem were only computing one FFT, my problem would be resolved just calling 150, but you see usually a 50 is only a step within a more complex simulation. I compute the FFT I take the result: I computer, madam, it is method multiplication I, take the result. Somehow I manipulate these results to compute my output, so these are just steps in big scientific applications.

A

So at some point our code, what we need to do for our science will not be available as a call to a scientific library, so we will need to analyze called real Kota. We need to write and understand the coal how this uses or reduces all the outputs from the scientific components and how all these outputs are my variables all of them are combined to produce my result. Okay, so I cannot escape from analyzing what we call cold components or patterns. It's the actual code that we write.

A

The PI example instead of using a library, I am I, have decided to call you myself. Okay, so you need to code it. So patents is when you analyze the serial code. We will propose to identify the components if you are using components coded by yourself. You will consider you will need to consider.

A

If there is a library of they're, highly optimized, for a system that you can use, then you will need to find patterns, sections of code that are not available in the libraries that would you will need to understand in terms of parallelism to this code parallelism and make a parallel implementation. So for that once we identify the pattern, it will guide you to create parallel patterns.

A

We saw that example with the PI example by examples, and a scalar reduction can be parallelized as a parallel scalar reduction, and we can implement this parallel scale reduction in many different ways. We saw at least four or five implementations in the demo before the break, so how many pal implementations can I have many as as many as you can imagine, just combining all the elements that open and PL open easy provide you, okay, so you will need to generate for the project patterns parallel code. Does the final step?

A

But here you have many possibilities that you need to compare and select the best one so how this process fits in the workflow we saw for OpenMP. We remember that we said for real codes. We need to first profile. Why? Because if I have 1 million lines of code, it doesn't make sense that I start by line number 1 up to line number 1 million, it's better that I focus on that part. I consumed mode of this equation time, especially if I go too deep.

A

You remember that we said remember that you need to minimize data transfers, but you need to have significant workload to flow to the GPU. To take advantage of the beast. The GPU, the GPU sophistic, has a huge computational power, so you need big problem sizes to feed that beast so that it can't really compute that fast. For you, okay, so begin with the hot spots.

A

Then we said that we had these two steps: fine in the hottest spots, those parts of the code I need to be analyzed for parallelism, decide how to implement them in parallel and make the actual panel implementation- and this is in this in these two steps- is where the pattern parallel pattern of parallel code. Core flow fits into the all the general go through okay, so you will work Intel ativy in the general workflow working on different loops, incrementally adding more and more parallelism to your code. So.

A

Components patterns patterns translate into parallel code, so more or less. We have already said this, but I want to summarize it in a set of four steps. First, real epic: we are talking about taking real applications or has read at least not two examples that we can use in training, an application that is part of your science first, even if you have not done it do at least one profiling just to double-check that you are focusing on the right part.

A

Okay, so you will identify, call routines, functions and loops that consume most of the runtime of your application.

A

Second, for each routine contain in an external library. What do you have to do? Remember that you want to run your code on a given platform. You may be using a generic library that has been compiled for a laptop and that can be ported to a quarry, but probably in quality. We have installed a highly optimized version of the same library called that you are using in your laptop.

A

So you need to consider identifying the scientific component that you have in your call, this f50 metasomatic multiplication by our abilities, solvers spectral methods and consider using the high loot image version that you have in a given system. So you can have take advantage of all the work that the staff of the center has done for you, okay, and this is terminal at time number 3.

A

Consider if you two just have to be aware, if you are coding a routine that this already available as a library- and you were not aware of it or you had decided not to use it because you need to take the decision. Do I want to keep on using my own code or do I want to replace this piece of code by one single library call. That is how you tonight for the system. Ok, it's up to you.

A

What is better for your science or for your code, given your expertise, but you need to be aware of that and make the appropriate decisions. Ok, so in that case you can consider replacing the corresponding routines with customized library calls available in in the system what you're going to run and finally, for the remaining user-defined routines routines that you have called as a developer.

A

Do you need to address the complex process of paralyzing your code, so for this, what we propose is they compose your code into components, particularly in code components in patterns and use what we will see next as a guide to generate different parallel versions so that you can pick up the one that performs best faster on a given architecture? Ok- and this is the final step number four for the remaining yourself defined routines- understand the code compute patterns that you have in your code. Ok,.

B

A

Let's see how we can paralyze guided by patterns.

A

In the parallel word technology, we have probably eight nine ten patterns, but some of them are rarely found in scientific codes. If we have, if we have them is because they appear in some domains, but they are not of general use so we'll have we have done with polymer trainer ease provides for support for those code patterns that are most widely used, and we use this terminology parallel for all parallels, color reduction.

A

You have seen this in the pile example, parallelize pass reduction and parallelize parse follow, and here you have a very simple as new, but that represents the key properties that distinguish one partner from another. Okay, just intuition.

A

The parallel for all is what the intuition says of water, for all is for all is typically loop, where all the durations can be executed concurrently in parallel in any order, you need to worry about dependencies or ordering of iterations okay. So this is what, as part, oh sorry parallel for all means. We can easily represent this as a loop. Well, in each iteration produces a new value. That is a store in a different element of an array. This is typically in scientific and omega computation.

A

Okay, so this is the typical, simple code that you will find when you recognize a parallel for parallel: the scalar reduction. Now you have a loop where all the durations compute a value. But what do we do with these values? We don't produce different, independent output values in each iteration. What we do is we reduce them all to one single value, using a some operator multiplication operator minimum maximum computation. There are very well known types of reduction operations, so this is typically represented at this. We have the values, as we have here.

A

Instead of producing one single different element in one iteration, we will reduce them all through this reduction operator. Okay, a space reduction, it's a reduction again, we have a set of values. These values are reduced, but instead of reducing all of them to one single scalar value, we reduce them to a set of values.

A

Okay, this set of values. Why is it called a sparse? Because the set of values that were these elements are updated depend on something that we don't know until runtime? Typically, do you have experience in finite element? Calls molecular dynamic, calls, okay, typical in finite elements or molecular dynamics to find this type of coal's. Do iterative elements or you iterate, on molecules and what you do is to compute the interaction between one molecular one element with the neighbors. So you need to update that contribution in the list of neighbors finite element, neighbors or molecular.

A

Never in named mole, never molecular weight molecules. So how do you represent in your code? The list of the neighbors of a given molecule, the neighbors of a given finite element? You typically use it throughout our silly array, which is represented as C okay, so these as part reductions can be in general paralyze using similar strategies, then we use for a scalar reductions, but we need to have into account something additional that is. The result is not jaqen value is a set of values that we only know what elements will be updated at runtime.

A

So the question is: how can we handle this in the penetration study? Ok, we can do it and we will see how we can do it. Essentially, this is the use case that we are proposing in the lunation K practical, that you will do after lunch, playing with and learning how to paralyze parallel start reductions that appear in many scientific terrains. And finally, we have added these recently because we found some use cases that what disappears. Essentially, we have the for all computation.

A

But now this person, 8, that is every single iteration- produces a different value, but they can eventually sorry compute the value of the same element they can coli. They can have conflict in producing the same element of the array, but it depends on the value of C. If C is a permutation, there is no conflict per se is not a permutation. There are potential conflicts at one time.

A

Ok, so these are the four patterns that we have supported and recognizing the parivartana tool, and now we will see how we can paralyze these patterns so just as a reminder to reinforce the learning parallel from typically a loop that updates all of the elements of an array. Typically, each iteration updates a different element of the array, and the result of the computing of this pattern is an array that is called the output. Okay. So how do we paralyze? This varies a parallel loop.

A

You don't need to care to worry about the way the International reorder. You can reorder in the most convenient way for your purposes, because they will never have concrete stress conditions in correct parallel. Behavior parallel called a scalar reduction. What you are doing is you are computing, multiple values and reducing them into one single value. That is called the scalar reduction variable important here. You cannot use any operator.

A

You need to use an operator that fulfills two mathematical properties, commutativity and associativity, because this enables reordering two plus three equals three plus two mathematically, but not computationally. Okay, so you need to guarantee that the operator fulfills these two properties for you to compute the scalar reduction in parent, okay,.

A

This is typically coded as a loop, and the result of the pattern is what is called a reduction variable or an output reaction variable, and here we have three different ways of paralyzing it. We have seen in the demonstration before the break. Essentially it is a parallel loop, so the same way of coding after all, but adding additional synchronization. Do you remember what the opening PA stand? Open is a standard cell at the beginning, the programmer responsible for adding appropriate synchronization to guarantee correctness.

A

So if your secured L all the loops on your program with a part of the loop- and you don't have any additional synchronization, that code will only be correct if all the code, all the loops, are parallel forints. If you have a different pattern, do you need to add additional synchronization?

A

So still, you may come create a parallel loop plus different ways of forcing adding the synchronization built-in reduction is using the close of the OpenMP standard of openness is a standard, but you can, alternatively, implemented using the atomic protection with one example, and we also have an alternative third implementation, that we call explicit privatization that we will see next, okay, so here for this case, we have three possible parallel implementations and all of them can be implemented in open, MP, open ACC for GPU offloading for multi-threading.

A

So we can generate many versions, parallel implementations of the same code, the next one is the sparse reduction. So remember that the key distinguishing feature is the array of output and the sparse nature that unpredictability of the values of this indirection, so a sparse reduction, combines a set of values into a set of values using again a commutative and associative operator I'm using a vector of array at the output, not a single scalar variable. Okay, the set of array elements that will be updated cannot be determined until runtime. Why?

A

Because only at runtime, we typically know the neighbors of the molecules in a highly dynamic molecular simulation, the neighbors of a finite element in adaptive, finite element code that changes the connections and refines the mesh of finite elements. Okay, of course, there are some problems where this array may have fixed values for the whole execution, and this will open different opportunities for the message, but in any case you can paralyze the Paris production using the same strategies and, finally, all the all the code patterns have an output.

A

In this case, the output is again a reduction variable, but in this case not any scalar array, okay and as it is a reduction, we have the three parallel strategies that we have for the scalar reductions, the parallel loop plus additional synchronization, okay, if possible, with built-in support from the standards we thought of making with a specie provide accession.

A

And finally, it's passed for all I will not stop on that, because it, the way it behaves is requires different ways of forcing synchronization, but just for the point of view of description, it's very similar to this path reduction. It updates the elements of an array. The set of array elements cannot be predicted at compile time it only. It is only known when you execute the application for the input dataset of that run of the application, and again you have an output variable. That is an array. Okay, so.

A

If we are able to take our code, our loops our hotspots and characterize the loops in terms of these patterns, what do we gain? What are the benefits that we obtain okay patterns unable to ensure correct variable management in the parallel code? What this means is that when you are open and peel open, SEC capabilities, you create the parallel region. You make the work sharing, but you have additional clauses where you have to specify for all the variables in the code what you have to do with them. Will you make it them private?

A

Will you share them will reduce the amount of threads. So you need to remember that you need to specify how to manage all the variables that are used rather written in your code, so the patterns characterize the computations for a given variable, so it provides you the information that you will need to decide, which is the correct way to manage that variable. That is, the output of the butter. Okay, the patterns also provide algorithmic rules to record sequential code into a parallel equivalent.

A

Once we know we have a reduction. We know the statements of my code that update the reduction variable I, know that I can't forget about the rest of the code from the point of view of that variable just need to protect the concurrent access to that statement. That updates available so for that variable I know how to manage appropriately with additional synchronization. The parallel execution thought the code is correct.

A

Okay, so it provides the algorithmic rules to generate parallel code and that's why parallel world trainer can do it for us, and also its pattern, has a set of policies and strategies that can be applied to it. So it also supports generation of different parallel versions of our single sequential code, using different standards and different hardware platforms. We saw in the dialogue that we could choose openmp operation, C, GPU, CPU, multi-threading, offloading or tasking paradigms. So all the combinations of this is all the panel versions that you can generate.

A

Okay- and you can do this because you know the pattern that you have in Turku. So, let's see, let's see a summary, an overview of all the patterns, the palace ation strategies and the hardware platforms. I want to use okay, so I have kind of a 3/4 dimensional space of all the combinations. So.

A

These are the strategies that we say we have for each of the patterns, and these are the ones that we have implemented in parallel trainer, and this is inspired in best practices in parallel programming, for instance through the analysis of the correlates marks. So what you can see here is that for a given pattern, the floral pattern, when you go to the CPU, when you have load to the GPU, to have one unique strategy available, that is the parallel loop.

A

This is simply because you don't need any additional synchronization, so you don't need to add in additional synchronization to guarantee correctness for the scalar reduction you can see. We have three implementations on the CPU multi-threading and only two implementations to a flow to the GPU. So in both cases we can use the built-in reduction support of open, MP and open incision, or we can use the atomic protection that we see by protecting the statement that updates the variable, but on the CPU you can also use Express implementation that is not available for the GPU.

A

The reason for this is that, and we will see next the specificity session. What makes this creates a private copy of the variable for each of the threads in the GPU. You typically have thousands of threads, so creating private copies may incurring a lot of memory overhead that make may make your program in efficient made the program crash because it runs out of memory. Strange things can happen in your code and if this is the case for scalars or maybe for a sparse reductions, then it's mandatory for the spot reduction.

A

Remember that what you are doing is creating as a copy private copy of the whole array for its thread. If your array is big megabytes and you use 1000 threads, each thread will be using megabytes of memory to allocate the private copy.

A

So typically on the CPU, we can scale up to 16 6100 threads with very fat notice, half a lot of memory, so we can pay the price of that memory overhead, but on the GPU we can create thousands of threads thousands of thread each of one having its private copy for array that can need megabytes.

A

So we can easily in Korean having been out of memory, so best practices, don't recommend to use a specific meditation for this part reduction on the GPU and the other reason why it is not implemented and supported in the paragraph a little okay and in the case of this part for all, we are working on having this specific meditation strategy available on the CPU. It is not applicable to GPU.

A

I will not go into details of that, and the other strategies also are not applicable to as password, because the way you need to combine the the partial results on the threads into the final result needs a special synchronization and additional computation that are not valid on the GPU okay. So somehow this provides you kind of summary table of all the possibilities that you can create with openmp open ECC pieces.

A

Remember that we don't say here: open, MP, open, ECC anywhere, we say multi-threaded on the cpu offloading to the GPU, because all of these combinations can again be implemented using open, MP or open SEC. Okay. So you have many possible implementations to generate and to test on your on your carpet.

A

Okay, so let's go into the details of how each of these well, we have not defined yet is in detail how this publication strategies actually behave, so we need to reinforce and learn exactly how they work. So, let's begin with the politician strategy. Parallel for this is trivial, because, if a parallel for all is found, for instance, in this code, this is the same code generated with the palace nippers generated with palaver trainer using open MP, an open ACC for CPU or GPU.

A

So what you can see here is that in each iteration you compute different values that are stored in different memory locations in different array elements. So you have a for all pattern, so you can paralyze it how just defining the parallel region. The first thing you have to remember is for all the patterns to go to make up. Our implementation is first, where the pilot region begins and ends. If you are focusing on the analysis of loops. Typically, the parallel region begins right before the loop header and ends right before right. After the end.

B

A

A

So this identifies the session of code to be executed concurrently, either on the multi-threaded CPU or a flows into the GPU. The next thing you need to do is remember: you need to manage every single variable that is used in the code.

A

You need to implicitly or explicitly say this variable will be shared among the threads, would be private to its thread will be reduced, all the private local values of the threads to one single value at the end, so you need to specify every single value variable so in the open up implementation of this more. In this version, we force as best practice for learning default, not default on what means in open, MP and open ACC is that the compiler, open and pure peace compiler will fail to compile your code.

A

If you don't specify all of the variables that I use here either in a share, a private or a reduction close okay, there are some additional variants. First, private is private. We don't go into that detail. What you have to remember is that all the variables that are used, you are forcing that you need to specify them explicitly here. So in this case, all the variables that are read-only are shared, but also the array that is the output can be shared among the threads shared means that all the threads can access.

A

Concurrently to the array at every iteration will be accessing a different element. You don't have Twitter different threads accessing to the same element. At the same time, that situation cannot occur because the parallel loop parallel for all pattern guarantees that cannot occur. If the analysis is correct in terms of patterns, it is safe to just create a parallel region.

A

Schedule declarations of the loop in the order that is better for your code, run it in parallel and the code will be always correct. Okay, this is the power of the parallel for all pattern and, additionally, here again for the sake of promoting and helping in learning, we force, we add to the schedule to a for the Clause, the clause schedule. What this means Duke will have several options to map the iterations of the loop.

A

In different orders to the to the threads, as you all raise your hands when I asked, if you have written MPI code here, essentially what you can make is a block distribution of iterations among the threads, a cyclic distribution of the trations amount of thread, a cyclic one cyclic to cyclic and distribution of the durations in blocks among the threads. So the same concepts are how they apply to data distributions in MPI implementations.

A

The same concepts are applied here to the specification of the way that the iterations of the loop are mapped are assigned to the threads in within the current parallel region. Okay, so we just put here Auto these delegates, choosing the right schedule to the compiler, but you can edit, here and bright static static, one dynamic runtime for five options that you can easily change to make different experimentation with your code. Okay,.

A

So, in terms of concepts with a parallel loop, you can specify where the parallel region begins and ends all the variables falls in the foreground. If they are shared private or array, then in particular the output array of the parallel loop needs to be shared, because there is guarantee that no rest condition will will appear and also you need to play with the work-sharing construct and modify default behavior using the schedule cross.

A

Okay, so once you identify in the loop the pattern, you can't have use this information to technicians on how to add correct code in OpenMP or in open ACC. Ok, any questions about this.

A

Great, so, let's move on to the first type of synchronization that we need to add to this fully parallel loop to this parallel form. That is parallel loop with built-in reduction again here we have a code that we already know that is the computation of pi. The same example we used before the break, and here they will the reduction. Is this variable sum that is taking the Rif final result of salmon all the elements that are produced while evaluating these iterations?

A

This is pression for different values of I and I is the loop index. So in general we have a reduction, a scale reduction, that is, each iteration produces a different value, a different value and in the end, at the end of the loop, we want to reduce them all with some reduction in this case to one single final value. Okay, so again, the loop is characterized by a scalar, parallel, scalar reduction. How do we translate this into parallel code?

A

Again, definition of the loop region begins and ends in the limits of the loop before and after the beginning, and the end of the loop again forcing with a fault. None we are forcing the specification of all the variables that are used in the interval in the in the loop here. Note that, if you specify do declare some variables inside the loop, there is no need to specify them in the palette region, because the variable is not declared here is not available here.

A

It's local automatically to it the thread that has been assigned that iteration okay, so even the way we declared the variables in our code can help us to make simpler, the openmp or the openness implementation. If we declare X outside of the loop, we need to add X here as a private variable. If we declare it inside and the cause is still correct.

A

Numerically then, I don't have any to really specify X here, because it doesn't is before the okay, so even taking into account these kind of recommendations encoding that ability to implement for any of us can help us to produce simpler, OpenMP and open edges implementations.

A

So, as a rule of thumb, all there is only variables that you can find here in big codes. All the variables that are not written that are read all of them need to be sure. Okay, only those variables that are written need to be somehow managed with additional synchronization. In this case, the only variable that is written apart from a greatest, not that is only declared within the loop body. Stroke is the variable soon, and the pattern is telling us that soon is the reduction variable.

A

So we know that in order to paralyze it, the additional synchronization you need to add is to put for reduction plus consuming columns. So- and this is instructing the compiler to generate the synchronization to make the reduction of these values. Okay, so share variables again the work sharing with for schedule, Auto and disable. We have commented the reaction, the variables that are wet only in particular the variables that is the output of the pattern. We know exactly what we need to do with it.

A

In a modern sickle, you can do this important. You have many more restrictions. Do you typically write coding in Fortran in C in C? C99 allows you to do that so any more than C. You should be able to do this with no problem in Fortran, depending on the portent flavor that you use, you may be allowed to do this or you may be forced to declare all the variables at the beginning of the function. So you are forced to manage the data scoping of those variables in the closest of the openmp pragma.

A

So c is more flexible in that. In that sense, it's easier if you follow, but it is usually easier because you forget about a variable. You don't need to manage it in the parallel region.

B

A

What is what we don't know here is exactly how all the set of element these expressions are reduced and the other. This is up to the compiler to up to make the reduction operations in the order that are optimal for a given platform. Okay, so we delegate on the compiler the way the order in which all the elements produced by the threads are really combined and we don't care about it. The compiler will do a good job at that and we'll current it that the result is correct.

A

Okay, okay, so the next strategy is parallel loop with additional synchronization using atomic operations. Are you familiar with.

A

Multiple execution, motor description concept, atomicity concept, okay, let's see what this means. It's we it depending on the discipline. We have different names for the same thing: okay, so.

A

Okay, this scenario is the same, is as follows: we have a variable s that is shared, so we have shared memory that all the threads can access at any time, so in particular, any thread can access at any time to the variable s.

A

What that means is that thread 0 and thread 1 can access at the same time, so we need to force another so that they don't.

A

They don't how they say this: they don't. If you don't, let say the different is a different way. If you don't guarantee that when the threads access to the shared variable s, do it in an exclusive way, so that when the thread is reading the value 0 having the value plus 1 and the result is storing it in the same shell variable if another thread can interrupt that process. What you can do is that you can have incorrect final value of the sum, quite because mutual exclusion atomicity has not been guaranteed.

A

Ok, so the way to call this Impala with atomic is it's read all the threads will access to the shared variable. So during the execution of the parallel loop concurrently, you will have thread 0 reading the value of s swimming, adding a value under storing the result in the same in the same location. At the same time, the number one can do the same at the same time thread number two can do the same, so what you need to do it only?

A

What you need to do is to protect this plus equal operation with atomic Tomic means. Is that whenever the thread 0 is doing this plastic ball operation, the rest of the threads will be waiting until thread. Zero finishes the plastic cooperation when it finishes it keeps on working and then other thread is granted access to the mutual exclusion section so again, without interference from other thread. He computes the plus equal operation. Okay, so these guarantees atomicity the operation of the plastic one. If we don't guarantee this, the result in general will be correct.

A

So in parallel, what we have is all the threads in the different iterations are doing different, plus equal operations on the same shell variable thousands of them. So we need to secure thousands of atomic instructions to protect the part. So intuitively you can see that there is a lot of additional synchronization that you are adding using this strategy, because every single plastic ball operation needs to be atomically protected. Okay and the number of atomic operations that you are seen is proportional to the problem size.

A

If you have 1,000 ratios one thousand atomic 20 billion operation, iterations 20 million AutoFix. So the politician overhead, the amount of synchronization crisis when you rise the problem size and when you write the problem size usually is because you want to go to the GPU, so you will need to need to find a balance, a trade-off when you use this strategy on the CPU or on the GPU, okay, clear, the concept of okay, mutual exclusion. So, okay from the point of view of implementation, again, we repeat the same series of things.

A

Definition of the parallel region enclosing the loop that we have analyzed in terms of code patterns shared variables. All there is only variables are shared again work sharing all the loop iterations are shared among the threads according to a schedule auto to delegate it to the compiler, or we can specify a static static, one dynamic runtime.

A

We have several options there and what changes from the previous strategy is that in the previous strategy to handle the reduction, the scalar reduction of the variables own, we just added here production, some reduction plaza now, instead of that, what we do is we atomically protect the update of the reduction value?

A

Okay, so you can see that there are many common steps in implementation shared among all the patterns, particularly the definition of the parallel region, the rules to defer to determine what variables are shared, the work sharing and also how to implement and to Atami to make it add the synchronization to protect those parts of the code. I need that are sensitive in the peril execution and we have all the information by using these patterns. Okay,.

A

So finally, we have the third strategy that is the parallel loop with explicit privatization. Let's see how this behaves in practice.

A

Now we have a slightly different scenario. We are still have the shared memory and the shared variable so that all the threads can access at any time to the share very well, but now we call it explicit meditation, because what we are doing is we create exactly a copy of the shape variable in it of the threads. So it's read has its private data, its private copy of the same share data. If the shell data's a scalar, it sure has a private scalar.

A

If the shell, that has an array, it's read, will have a private array.

A

So once the memory has been allocated, what happens during the computation in this case thread zero works with private copy zero, so no need for atomic protection because by construction of the shared memory, programming of OpenMP, no other thread can access to this variable only thread zero can access to copy zero, only thread one can access to copy one.

A

Okay, so this is a key difference between them, because we are removing all the atomic protection that is needed during the computation. The threads work complete independently from the rest of the threads with no synchronization. Where is the police station overhead in the amount of memory? Where is the position overhead of the atomic protection strategy in atomic operations? And you don't incur in additional memory so paralyzation in terms of penetration overhead?

A

We always need to find a trade-off between additional synchronization with atomic or mutual exclusion and additional memory to create private copies of variables to the couple. The execution in parallel of the different threads and having these two things in mind and finding the right balance for the code is where you can really create, is how you can really create a very efficient part. Implementation of your code in the privatization, is one of the most effective ways to really implement a scaleable parallel code in real applications.

A

So during the computation, no atomic protection is needed. But of course it's thread at the end. Has its own private copy has a partial sum of the final result. So we need to do something else that was not needed in atomic protection. What do we need to do? Its thread contributes to the shared memory, shell copy, its private, a local result and a secure. The final private partial result is sum to the share memory.

A

Here we need atomic protection, so in this case we only need as many atomic protections as number of threads that we have. In contrast to the problem size, you can have 1 billion iterations and for threads for italics in the others. In the other. In the other approach, you have 1 billion iterations 1 billion atomic for 4 threads, ok, so it's always a trade-off between the amount of memory to reduce or remove synchronization and the amount of synchronization that you need or the minimum. You need to guarantee correctness.

A

Okay, so with a specific motivation by explicitly creating private copies for its a thread of the original variable, do decouple the execution, do the computations of very fast but is very efficient, and you only need to synchronize at the very end, and here is what you can do with the parallel with the parallel trainer. This again, the same by example. Before going to the case of a race, let us start with a simple case of a scalars. We have the same by example that you already know so what has happened here?

A

The tool has created the parallel region has created the worth sharing and now, instead of making atomic protection or of making close reduction, it has created a preamble before the loop that is create a private copy for its read now in the loop they stay referring, the use of the original share variable has been replaced with usages of the thread local variable. So here all the threads are working independently with no synchronization alone.

A

Once they finish, then they take it thread. Using atomic protection update deprive the final share variable with the private result using a portable. So somehow the specificity session would make this, for the original loop creates three stages in the pal implementation: the same loop, replacing the original uses of the reduction variable with private copies, a preamble to declare the private copy and initialize it and app Istanbul to reduce to compute a final result using the thread. Local partial results completely by three. This is what you see here: preamble main body and the postable okay.

A

This applies. This can be applied to scalars, but can also be applied to arrays, and this is what you will. Probably, if you complete the rule as practical, you will be doing this with a quite complicated :, but you simpler trainer again the same loop, parallel region, work sharing with the schedule again, the uses of the global variable y, replaced by the uses of the private copy preamble, allocate a private copy, but now it's not just declaring a scalar.

A

You need to allocate the memory and you need to initialize each of the elements of the array, so the tool will generate this call for you, so the preamble is about generating private copies with all the elements that available the original variable house, if it is a scalar, is trivial. If this array is as many elements as original reference, okay, so here it come, can't leave it thread, works without interacting with any other thread on its private copy.

A

And finally, we make critical is another way that we can use in open MP to guarantee atomicity mutual exclusion. What this means is that, when one thread enters in this critical section, only one thread will be updating their secuence. The original shell variable with the values computed locally when it finishes the rest of the threads, are waiting. Another thread is granted access. Another thread enters compute this, while the rest are the remaining are waiting and so on. Until all the threads are granted access sequentially to compute this part.

A

So in the end, what you have is the same result as in the scalars. The original variable computing the global result of the ring. Okay. This is what you see, for instance, in the real newness application of the color beige box, the Holly they make further image sensors by reducing the amount of memory the use here, but essentially the concepts. The best practices recommendations is what you can find in our tool today. This comes from that work that we published with all written with.

A

Do it to understand how experts in parallel programming called real applications, and this can be applied to many scientific fields, finite elements, finite volumes, molecular dynamics, many other fields where Giza sparse these parallel part reduction appears remember that this can be applied to.

A

As part reaction, okay and remember that, for a spot reduction, you usually don't have built-in support in the standards. There is some differences for Fortran and C something, but in general you should consider that there is not support to make reductions on arrays in openmp Nopec, apart from some exceptions that we do have in the standards. Okay. So that's the reason why, for a race for a spot reduction, you need to use a lot of it: protection on the GPU around the multi-threaded CPU and on the GPU on this.

A

Multiple CPU can use this specification study that cannot be applied to GPU, because we said we will create, allocate private copies of the whole array for thousands of threads exploding memory usage and will easily run out of memory. Okay with the application may crash. Okay, that's why we don't support. We don't recommend using this strategy for the GPU.

B

B

A

And cons I.

A

Feel immoral, as we have discussed all of this, but this is somehow to summarize pros and cons of.

A

Of each of these strategies in general, remember that the only one that has no secret session overhead is parallel loop and if this easy to implement it is great because you don't need any reason or synchronization the analysis in terms of patterns current is that each iteration writes on a different memory. Location, no need to worry about potential waste conditions or incorrect behavior.

A

The built-in reduction is great. If you have support in open, MP or open SEC is similar, as in MPI you have in MPI, you have MPI reduce MPI reduce, is an implementation for reduction operation across the MPI racks, so somehow reduction operations are so common that all the parallel programming tools have some built-in support for reduction operations. The question is: would you return the internalizes? What reduction operations are supported by the tool you are using? Because if you did, the operation is supported? It's great.

A

You just call use the visiting support and everything will work just fine. If it's not, then you need to use alternative implementations now the most recent versions of OpenMP unable to provide user-defined reduction operation, but this has come in open, p5, I. Think, and here we are considering up to open MP 4.5 so up to 4.5. You didn't have that feature, but this is something that is coming in the next upcoming releases of compilers.

A

But anyway, if you decide not to use it or you don't want to use those features or you have them available, you still have two other strategies that you can use. The atomic is very easy to understand. You don't need to change the code. You just need to execute the code fully parallel and those operations that are the reduction operations in add a synchronization to guarantee atomic protection intuitively.

A

It's easy to maintain easy to understand easy to apply, but, as we said, comes synchronization overhead is proportional to the problem size, the number of iterations not to the number of threads okay and the memory requirements are minimum, because you are only using the same memory that you used in the sequential code.

A

The specificity session has drawbacks in terms of memory. You are using more memory, much more memory potentially for arrays, but this allows you to remove synchronization and reduce the synchronization overhead to a number of operations. Atomic operation that is proportional to the number of threads, not the problem size, so you can scale to very large problem sizes for your science and your application, parallel per implication, will still a scale in performance.

A

As far as we understand, the built-in support handle by the compiler behaves more or less like this PC privatization, because you do measure performance is they are more or less very similar. Well, we expect from the compilers that they are able to make more optimized implementation teachers of the final reduction operation that we are implementing right now in parallel trainer. There are ways to do the reduction using trees using scheme that allow, instead of doing the final reduction sequentially one thread after the other. You can somehow do that in parallel in different stages.

A

So compilers are supposed to do that, optimizations for the cargo platform that you have and but of course you can find an optimized implementation of a specificity session by optimizing. The amount of memory that you're locate is allocating a full copy of the array. You can allocate less memory as long as that memory is used on quality reference in your code and you can reduce the amount of synchronization overhead by implementing some kind of tree reduction in the final part, but I would make these learning these concepts a bit more complicated.

A

So we didn't you have decided not to implement that in the version we have available so far, okay, so more or less this us. This is the everything you have so you can in the practicals. You will be playing with examples that has examples of all the three patterns. So when you modify a loop, you can generate a different version of your code. Where for its loop, you can generate all of these versions. So in the lissome case you have 12 loops, 12 loops, it's a loop, you can apply two or three strategies.

A

You could generate up to 40 different versions. Parallel branches of your code, okay, by just combining different strategies applied to different loops across the whole code. So doing that by hand is very, very time consuming. So the trainer helps a lot in learning and in producing code and is making this permutation and helping in the implementation process of implementing all of these variants. Of the parallel go. Okay, one thing we will not explore, but you will do have in the in the documentation.

A

It's something we have already added in pali word 21.2 the version you have in Kali in curry. Very briefly, we have added support for tasking dustiness, another paradigm. That is great in a lot of interest in some scientific domains. So it's a possibility and you have there. You can use again more possibilities for the different strategies for the different patterns to generate more versions of the code, and this is more or less goggle generating right now, the same PI loop can be implemented in parallel.

A

Instead of using for to view the work sharing, the work sharing is simply stood on by creating tasks that are finally synchronize with tasks wait. This is the tasking support typical of OpenMP 3.5. You have it available in training 1.0, and also we have added support for the test load pragma in openmp 4.5. So, with this implementation of this syntax, you have also a task in implementation of the of the same goal: okay, okay. This is just for your reference.

A

You're not expected to play to play with this unless your interest- and you want to split during the practical, okay, okay- and this is upcoming- work that we will be working on reductions, also supported in the target per line.

B

A

So I think that's all we wanted to cover so remember and keep in mind that this is what you really need to have in mind because doing the practical you will be able to generate all of these versions for all of the loops that you have in the example code that we will share with you in the practicals.

A

Ok, any questions so concepts pattern, strategies that are applicable to its pattern, so different ways of implementing it, and then you can implement each of these strategies using a choice of OpenMP operation, sea of loathing, multi-threading tasks in GPU CPU. So you have a lot of possibilities to generator to play for your goals.