National Energy Research Scientific Computing Center (NERSC) Parallelware Trainer Tool Workshop, June 2019, 6 Jun 2019

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 1 Introduction to OpenMP/OpenACC and GPUs, & 2 Accelerating with OpenMP OpenACC Copy 01

Description

Parts 1 and 2 from the Parallelware Trainer Tool workshop at NERSC on June 6, 2019. Slides are available at https://www.nersc.gov/users/training/events/parallelware-tool-workshop-june-6-2019/.

A

We are really pleased to be here presenting what we are doing in from Spain in upendra. That is essentially creating a new set of tools based on this parallel word technology that is unique from a painter. That's why this startup was incorporated on unfunded and essentially, what you will be seen today is how to use this tool to learn best practices about parallel programming, with open, MP and open in C for GPU programming and learning this based on a different approach that is called based on understanding your code from the point of view of patterns.

A

So we will try to see and cover all of this approach so that you can really understand how the Tool Works- and you really understand why the tool is able to do what it's doing right now and I. We also try to set up expectations on your side on what the tool can do right now, but we are working to have new features during this year and what are our expectations for the tool for your next releases?

A

Okay, so, as I said, we expect you to learn essentially a different approach to parallel programming, not just looking at your code, looking at your instructions, the dependencies between them and try to insert instructions to somehow try and test is everything works, just fine, and if you find that the code doesn't work, try to find out where the problem is and fix it like modifying, closes and pragmas. This process is time-consuming, is a real problem.

A

It takes a lot of debugging effort, so we want to somehow avoid such a LeFort in that part of the development workflow by understanding from the very beginning how your code behaves and how this behavior can be used to understand how to code your code from sequential into a parallel version that is correct, and that is performant. Okay. So for that purpose, what we expect you to learn today is learn how to decompose real codes into these parallel patterns.

A

For this purpose, we have prepared a simplified version of the well known knowledge benchmark of the current benchmarks from hydrodynamics the scientific field, we're using this tool, this version of the code. You can see that you can address complexity in the code and you can Perl try to paralyze with a train and learn how to paralyze more or less real codes. Let's see how we can understand the limitations on the current benefits of the tool.

A

Next, what we will see is how we can apply this understanding of your applications in terms of patterns in terms of code patterns to one single problem, the one that probably has motivated motivated you to come here, try to paralyze port, your calls to GPUs using open and piano play CC, but many of the concepts that you will learn will also be applicable to multi-threaded programming, to simdization, to other paradigms that you can use to code and to implement the parallel version of your code.

A

Not only openmp open easy for GPU, so you will learn concepts that are properties of your algorithm of your code, independently of the Harvard platform that you are using. Ok, this is the power of this pattern based approach and what we have learned we have done in the tool. We have collaborated with Center from original lab nares Barcelona, supercomputing, Center Julie's. In Germany and we have collected and come to an agreement with all of these people of what are some, how best practices for parallel programming using open, MP and open ACC.

A

Somehow what is good for you when you want to implement up code in parallel, but we've got kind of kind of implementations. You can expect to have good performance. What kind of implementations you cannot expect to have good performance of? Why? Okay, so this best practice of a parallel program is something that we will learn and somehow discuss during the training now India in before Rome. So sure this is know, oh, you can use the patterns approach both for CPU and for GPU, okay I'm, giving the PAL Warriner tool.

A

You will see that as you can set up and choose between different CPU GPU platform, open MP, open ECC standards and even different programming paradigms like multi-threading offloading and tasking okay. So we will see that in the demonstration of the tool and, of course, do not hesitate to interrupt me at any time. I will be really pleased to answer your questions, so the agenda for today before the break I, will try to introduce the minimum set of concept that you need to understand to go from the CPU to the GPU.

A

We will not go into details of hardware. We will not go into details of the semantics of the pragmas of the closest. You will learn that by practicing and learning what the tool is doing, you will see the tool doing some work. You will analyze team put on the output of the tool and you will be able to learn all of those issues. So here we only want to introduce the key comes to the distinguish: differentiate, CPU programming from GPU programming.

A

So you have the minimum set of knowledge that you need to understand to make good programming of the GPU, and then we will show you how to read code using open, MP and open ECC, and here I will do a demonstration of the tool I walk through through the graphical user interface. So you can have a first feeling of how the tool looks like I saw to use it. Okay, later after the break, we will focus on the key, a theoretical concept of the patterns, again very light way.

A

There's the key concept you need to understand and with the help of the tool, you will be able to recognize these patterns and apply different policies and strategies for its pattern, and here you will be able to do by yourself a practical repeating the code that I have used in my demonstration, the pile so that you on your own. You follow in the worksheet that Helen and I will share with you during the morning.

A

You will be able to follow a set of steps to get used to the tool to generate multiple versions of pi measure, the performance of these different versions on quarry using the CPU using the GPU and with all that knowledge, to be able to compare and pickup, which you think is the best implementation that provides your best performance. Okay. So we will use the same example that have been used here.

A

Yeah and you will use the same example here in the practical then in we will have a working lunch, so we will be here having lunch, you will feel free to ask questions and we will sit together to answer all of your the issues you come. You can rise in the afternoon.

A

The idea is that if you have a code that you are interested in trying to understand all the compost in terms of patterns and see, if you can approach the palace ation of the tool of your code with our tool, then we will sit with you to see how you can get the first approach to using the tool with your code for doesn't have a code. We have prepared a simple example of the simplification of the Lula's collar Benzema, again with a worksheet with a detailed set of steps.

A

So you can cover everything you need to create a GPO, enable open, a CC version and open MP version that runs faster than the GPUs of Kali. Ok, so you have never heard about this code. You don't know what I do dynamics is about it, not my field. You will not need that does important thing. You need to learn. It's only about the code. How do you code your algorithm? What are the properties of this code, and this puts a set of limitations on on how you can paralyze your code, so they call the components.

A

Dep code patterns focus on the code, not on the science, not on the hardware platform. Only on the go. Ok, that's where they keep part of the of the training news. So, okay, this is agenda for for today.

A

So, let's start with the first set of his lies,.

A

Who of you here have program already the GPU someone has already problem is: if you someone has someone has used openmp, no OpenMP, no open sec, no parallelism, okay, great, then it! So it's it's a good approach to have this content here, because we begin from the very basics we don't assume any previous knowledge on parallel programming and GPU programming or multi-core program, so feel free to ask questions because this course is for you. We begin from scratch. Okay, so let's go on to lecture number one introduction to open, MP, open, ACC and deep use.

A

Ok, GPU is kind of turned in topic. Everyone wants to program the GPUs, because the colleague is also working on GPUs and is getting amazing speed. Ups on the code. Okay, so somehow the GPUs are trending topic and it is. It is supposed to be kind of the future for to get good performance or pit performance on pixel scale and XS k supercomputers the machines that are being manufactured now that will come in the next generation. Why? Because the CPU usually takes a lot of power. The GPU consumes less power.

A

Exascale roadmaps that is building the next generation of the most powerful supercomputers in the world. Mostly all of them I think this. This this this is from November 9 out of the 10 super computers use, accelerators, 5 of them are using GPUs. Others are using that different types of accelerator. So if you want to pull your code to one of these machines to make big science, then somehow you will need to take advantage of the GPUs to get resources allocated for your go. Ok and part from that GPUs are ubiquitous in my laptop I.

A

Even have a GPU, so I can install the full software stack and use open, ECC or other programming languages to aster in my code on my laptop. So it's something that somehow you need to learn, because at some point you will need to port your calls for pick performance to make good science and big science.

A

So what is the GPU? The GPU is designed is a hardware specialized and designed to make massive floating point operations. So, while the host or the CPU, you usually see it as a set of floating-point units with more or less that are more or less limited in time. In my in in number in the GPU, you can see it as thousands of floating point unit operations, independent of each other that you can use at the same time.

A

So this provides all doing many things at the same time, in different pieces of the harbor provides a lot of computational power that you can use to accelerate your goals. Okay, so somehow what you can imagine is that you need to somehow specify how your code can be executed in vector instructions. Okay, somehow you can just remember this idea. Vector instruction is something that if you execute one is the one instruction on a vector unit, you're using only one of the lanes of the vector unit.

A

So four lanes want one instruction: you are not using three or four 14-point units, so imagine that you have thousands of them. If you run sequentially, you cannot expect to have good pic performance out of a design of hardware like the GPU, so don't run sequential code on the GPU, because it will typically be a special over matter slower than running a simple multi threaded code on the GPU on the CPU, and apart from that, we will not go into those details.

A

I will only introduce you some of the concepts you need to understand why the GPU has a complex memory design. The reason why it is so powerful is that all of these thousands of floating-point units can access to memory units that are dedicated to groups of these floating-point units, but these disposal limitations in the communication between the threads, so not all threads can communicate with all the remaining threads. This is a different characteristic from the CPU.

A

Whenever you create a zip OpenMP matiee through the program, all the threads you create can communicate with the rest of the threads. This doesn't happen on the GPU because of somehow is an implication of the complex memory design and the complex hierarchy that you have the GPU. Okay. So just remember these things and you will see how we will be introducing a few concepts so that to see how you can use the GPU and the computer memory design to accelerate your code in.

A

Despite of the complexity of the hard work we will do, will not recall me to go deep into the harbour details: oscillator codes using openmp and opacity, okay, so, in contrast to the cpu model to multi-threading, we usually have one host and one memory. So you start your sequential code here. It uses the memory to make the computations and provide you the result. If you enable the code with multi threading with openmp, you have multiple threads here running at the same time, making access to the same memory to provide you the result.

A

When you want to use the GPU, you need to start your application. Your code on the CPU either single threaded multi-threaded, but it needs to start on the CPU and at some point you need to specify a region of the code. I think CUDA terminology, for instance, is called a kernel that you offload to the GPU of loading means that a separate binary is created for the harbor of the GPU that that binary is transferred to the device.

A

So you need to transfer all the data to the map that you have in the memory of the host. You need to be transferred to the memory of the device, so data transfer is a prerequisite for your code to run on the GPU. Okay. This is what is typically known as the GPU accelerator model.

A

You have a host and device each one has its own memory and you need to transfer information code and data from the CPU to the GPU host to device and back from the GPU to the CPU for the device to the host okay, and this is essentially what you will specify when you are OpenMP and open in CC capabilities. You will say: I want from this code this piece of code to be offloaded to the device, and you will need to specify what data needs to the transfer.

A

So we have just said more or less all of these things, DP execution model is post driven execution model. Remember that your code will start on the CPU. Only the part that you specify that will be offloaded to the GPU will be executed on the GPU, a better. The result will be transferred back to the CPU. That is the code that will provide you with the results of your silence.

A

Okay, sequential code runs on a conventional processor on the CPU of your machine, so the computational intensive part of your code need to be transferred and accelerated on a GPU, ok and to maximize performance on the GPU. What you need is to identify those parts of your code that consume most of the execution time. This is what is typically known as hotspots when you do a profiling of your application. How many of you have made a profiling over the application?

A

Okay, half half more or less is the numbers that we get when they make these questions. Many people has even ported code, two CPUs multi-threaded and GPUs, and they have never profiles a code.

A

So if you start from scratch with a code that you don't know, is mandatory that you start with a profiling, so that you will focus only on in the first time first steps in those part that consumed most of the time one hour, the fourth in development time that you invest, there will provide you a greater, a bigger return of investment that focusing on a part of the code. That's only 5% of the situation time! Ok, that's why combining this with profiling is is important.

A

So once you have these parts of the code identified, what you need to keep in mind is more or less these three guidelines transfer the data onto the device and keep it there. What that means is transferring data from the memory of the CPU to the memory. Cpu is the most computational consuming part of your GPU accelerated code. You need to minimize it. If you can transfer data, leave it there, 'never transfer it back, just leave it there and use it. Don't transfer it back.

A

Okay, so remember minimizing data transfer and leaving on the device, if possible, give the device enough enough work. To do. Remember that you have thousands of floating-point units, if you use only 10% of the floating-point unit, you are using the GPU, but you are not using 90% of the floating-point units that are available for your goal. What this means in general is that you will need to run big problem sizes to take advantage of the computational power of the GPU. Okay and.

A

The third one is what we already mentioned: data reuse use the data that you have already in the device either because it has been transferred the beginning of the program or because it has been computed and produced on the GPU just leave it there and reuse it as much as you can avoid data transfers.

A

That's what you need to to have in mind: okay, so more or less, we have a very basic understanding of the GPU secretion model, a more or less some guidelines of what we need to have in mind whenever we design or implement our particles.

A

So let's go on and see why you seen OpenMP or open ECC, in contrast to many other programming tools that you can have in the in the ecosystem. So first of all, GPUs have a reputation of being very difficult to program and they are difficult to program. Indeed, if you want to program the GPU to achieve peak performance using CUDA using OpenGL, you really have to rewrite completed your code. Probably you have to rewrite your data structures.

A

You have to record your algorithms, your loops in order to adapt all the program to the hardware and to the features of the hammer. Ok, so we want to do. Is we want to avoid that? Because that's very time, consuming very complex has a very high learning curve. So how do we do that?

A

Ok, open MPV and open ACC are here to help us to bridge the gap, so essentially what they provide us is a set of directives, a simple application program, interface that we can incrementally use to a GPU, enable parts of our code incremental e, without rewriting the whole code, as we have to do in lower-level part I mean like CUDA or like opens here open and P. An open SEC are designed with productivity in mind.

A

What this means is that when you create a parallel, creating a parallel version of your code is very time consuming it's complex. You need a lot of expertise to the soul, so whenever you create your parallel version, the question that you have is okay. I created it for system one, but I need to do bigger science to run it on system. Can I run into system? Can I pour the code to system two so that it runs and provides the correct results so open, Edition, open impe have been designed with portability in mind.

A

What that means is that, as you do with your sequential code, to just recognize your code with appropriate flags on a different system and the code should should run okay and also good readability, and remember that, if you use have you used MPI, for instance, all of you has use MPI the sequential code. If you want to create an MPI version, you have recorded completed your code.

A

Maybe you can recognize some of the loops, but you have made MPI init them MPI, finalize all the data transfers, all the communications, so the parallel code and the sequential code hardly resemble one another. Okay, so open, MP and open is a I decided designed to avoid that you can capture sequential code. You are open, MP, open, ACC capabilities to the pragmas, but you still have one piece of code that you need to maintain and improve not two or three separate codes, one of its tailored or specifically designed for one parallel programming. Okay,.

A

Another good thing about open MP on open ACC is that the abstract away many details of the hardware. If your code in MPI, for instance, you need to code to program every single communication between every single pair of processor or set of processes on the GPU. If you crawl code, that alone level using a library like a CUDA or OpenCL, do you need to call whether they call the third day the current salon said how they communicate? What did what parts of the memory they access?

A

What different levels of the memory hierarchy they're using the robot memory? They share memory, the scratchpad, the cache. You have a very complex Hardware on the GPU, so you need to be aware of that when you program at the low level, so the good things about OpenMP on open Issy, we will see today many of those details are obstructed away for you. You don't need to care about them just to notice some of them exist and provide you openmp and openness to see some ways to control how to use some of these power features.

A

Ok, so a implication of all of this is that it minimizes the need for code refactoring sequential code when you wrote your first MPI version to record your application, probably most of it so opening piano privacy I'll decide designed to avoid that. To just add some pragmas enable some flux in the compiler that converts those plasmids into a parallel code, and if you don't want to use them, they are disabled, the flag of the compiler, and you will have the original sequential code, no need to have maintain different versions of of.

B

A

Ok and OpenMP open ACC support, C, C++ and Fortran, ok in order to correctly set expectations that you may have from par word trainer tool at this moment. At this moment we are supporting the C programming language because of some technical implications, but we are working and respect during this year to have for support for C++, especially for C, like code within C++ files, and also we are working on fortune.

A

So we have forced result that we will present in IES see in front for in two weeks in in Germany, but we hope to have some of these four turns up or by the end of the year by supercomputing, but is something that I want to set code read expectations. We are working on it very hard, but let's see how we can we can do it. So all the example that you will use today are written in the C programming. Language are all of you familiar with the C programming language, nor or less yeah.

A

Yes, all the method about the composition of the code in patterns. This applies to any simple in any programming language. It's independent of the programming language that you're using what is a type 2 C is the current version of palaver trainer 1.0 that we have a stallion call in curry and we will be using today, ok, but we will improve the product so that we can support C++ important in the tool. Ok,.

A

We can evaluate that as long as you recall some of your functions in a silica style within the C++ code, you can analyze those files with a parable, 22. Ok, we have done it in the past, so we can do it, but it's very is very dependent on the features of the C++ programming language that you are using in your code. We need to evaluate that. Ok,.

A

Yes, that will help you to paralyze single threaded code, single through recall, maybe a sequential serial code or maybe an MPI rank within an MPI application. So somehow you can use our tool to make hybrid your MPI application. Okay, that's another uses of the tool.

A

Multi-Gpu support not at this moment at this moment we are working with supporting one DP. You good question.

A

We will see it in the demonstration that we do of the tool in a few minutes: okay Lucy, that, okay, any other questions great.

A

Okay, so, finally, what are openmp an open ECC? They are one more method to use the GPU and they are designed as extensions to the programming language, that is, the pragmas and directives extend C, C++ or Fortran. So, if you have your code, you have to add the Centers of the directives and the pragmas to add openmp opening capabilities to Yoko. It's an extension to the language is not part of the language itself.

A

It uses compiler directives. What that means is that you have the support of a compiler that whenever you specified the app the correct, pragmas directives and closes, it is the compiler that does the hard work for you. If you code in MPI, you need to decide when you expand the ranks when they communicate when they finish so a1 and the end finishes execution is, do has to decide and make that implementation. So in open, MP and open ACC, we will specify where a parallel region begins.

A

Where our parallel resonance and with the support of the compiler, it will generate a compiler, the code, the binary code, to create the trends using POSIX threads and to destroy the threads at the end or the parallel region. You don't need to worry about all the complexity of using the underlying threaded library available in the operating system. Ok, so this is what compiler narratives and having a common report of a compiler means in openmp. Analysts see both of them use a hostess related programming model.

A

Remember that it is the cpu that Esther this equation, that controls this equation. We are only offloading the most computational intensive parts to the GPU and the CPU is waiting for the results coming from the GPU. Remember that it is a host driven execution model and all of this, both of them use the concept of threat or task so more or less simplifying a lot.

A

You can consider the abstract concept of task, and several implementations of stress has processes some tests that collaborate to solve one single problem in parallel to finish early faster and to provide you the same numerical result. Okay and again, they are focused on on portability of your code.

A

Would you want your code to be executed on curry, but you also want your code to be executed on the next machine that will come to newark, and you also want your code to be executed on your laptop or another supercomputer that you need to use for the purposes of your silence. Okay, that's what portability means, okay, so just to finish that off to finish that this part benefits and limitations of open and P and open SSE benefits open a piano policy are simple to use. You will see they are portable across different systems.

A

Re compiling your code as you do with your sequential code, and they are how we're independent. When you have an opening pickle, it will run on any multi-threaded operating system. As long as you have supporting the corresponding compiler of the open and PS Tanner. All of the open is that okay limitations, as we said before, we have the advantage of making parallel programming more productive, faster, better use of our time, but this comes at the cost. The cost is that you cannot control everything in your program.

A

You can only control that those features that are exposed in the application program, interface of open, MP and open easy C. If you want to do something different, then you need to go and use a lot different tools like CUDA OpenCL, that I designed to allow you to control everything that you can do on the GPU. But for that reason they are much more compact, complicated to you and, of course, open.

A

Mp and open C are designed to be interoperable with other programming languages so that, in the end, if you needed, you could use several tools in your interpreter. Okay,.

A

So, finally, again to set kora to the expectations when you first go, take your calls and take it to the GPU. You can expect your code to be significantly significantly slower than the code on the CPU. That's what happens most of the time in the first versions.

A

So once you learn the basic knowledge once you know how to use the standards on how to call for the CPU using OpenMP and Oprah, you see you can begin to think how to make your parallel implementation better so that you can increase incrementally the performance of your application and at some point, your GPU code will be faster than the CPU code. Okay, so, in order to of the automates performance, remember the GPU! You need to reduce data. That's number one. Priority avoided that transfers.

A

Whenever you can allocate memory, that is in the GPU, you see their computer and avoid data transfers back and forth from this between the two memory systems of the host and the device. Okay and again for peak performance, you need you! When you will, we see papers or articles or announcements about the computational power of the GPU with many times see applications and run 200 times faster and read our code 70 times faster than the real ago. How can you achieve that?

A

You can achieve that peak performance, usually by making a very sophisticated programming of the GPU so for an apple, an average application having a realistic performance of your application being three five ten times faster is something that you can consider a good performance on the GPU without going into the burden of all of the details of the low-level programming interfaces of CUDA or occasion. Okay, so depends on the application. Again you can, depending on their characteristics, on the patterns of your application. You can even obtain higher speeds, but it pretty much depends on your application.

A

A

A

Before doing the demonstration of the tool, just let's do a very fast review of the pipe of the steps you have to do in order to in general paralyze your code, in particular to paralyze your code to execute it on a GPU. So you begin remember profiling, your code, if you have never done it, it could be good, as you follow one of these courses to make a simple profile into to double check and be sure that the functions you are working on are those that really consumed most of the situation. Time.

A

Okay, that will return you the biggest return of investment of jerae forde in going to the GPU, so identify the hot spots. Second, probably the most difficult part of paralyzing for any platform analyze your code to discover parallelism. Okay, so you need to understand your code, as we said here is where the components approach, the patterns, approach that we will be using, provides a lot of value, and it's completely different to other approaches, as you can see in similar courses or tutorials or awesome.

A

So in analyze, for parallelism is what you will see: the value of understanding. Your code in terms of code components next, once you know the hot spots, the loops understand them in terms of parallelism, and you say: okay, this loop can be paralyzed. Then you need to decide how to implement that parallelism. That's what we say here: adding directives in the using open and pure operation see these are implementations of the parallelism.

A

You have this covered in the second step, so implementation of paralysing with directives again in this third step, a directive, palaver training will help you to produce many implementations using OpenMP and operation of your single cycle. So we will help in these two stages. Mainly so, when you produce a parallel code, then you need to compile it and run it and measure performance did the performance increase? Yes, is it enough for me for my problem stop and do go to another different stuff?

A

If not, then you need to optimize your code, which typically means improve data locality, minimize data transfers on the GPU and start again profiling to see if now they profiling, the hottest pot that you have found before is again and keeps on being the most computation intensive part of your code. Okay. So this is an iterative process that you need to repeat.

A

Okay, so when you go through all of this, essentially what you will find is that you have your code. You have your hotspot. You have identified parallelism to have implemented a parallel version that runs faster than the original code, so you will be accelerating this part of the code, but in order to get peak peak performance, this 100 speed-up acceleration. You need to also paralyze all of this sequential region. This will be with the reporter.

A

Next talk to relative this peak speed up speed ups in real applications; okay, so this is essentially the effect of paradise in loops. So let's go to the demonstration. So in the demonstration you will see openmp an open, easy pragmas. So what you will see is see code with something some extensions. These extensions have the form of preprocessor pragma, this special symbol pragma. What is called a sentinel sentinel identifies the family of pragmas that you are using OpenMP uses the sentinel OMP open acc uses the sentinel ACC.

A

What that means is that after that, sentinel you have the name of the directive. We will use parallel, we will use four, we will use critical, we will use atomic. We will use data different pragmas that by default they have a meaning. They have a behavior that is specified in the standard, but that you can modify using several clauses to modify the default behavior of each of the directives. Okay, so this is essentially what we do we'll see.

A

New c c++ with a fault is the thinker's of a problem important with the sinter's of a special comment with this dollar symbol before the Cynthia okay, but the rest is typically the same directive with the closest to modify the default behavior, okay, I'm just finally, together started with the demonstration open, MP and open it easy compilers. We have several of them. Probably the most mature, open, ECG compiler is PGI in the market.

A

Do we have a newer version in curry, 19.4 I think we also have a cream machine with a great compiler that also supports open, MP and open ACC, and also a in DC and in clan. There's free, open source compiler. She also have support for open MP very much to support and also they are pushing moving forward support, floppy necesito. So in the most recent versions of DCT compiler, you can also compile open, ACC plasmas and open MP promise.

A

Okay, so I think that all of these components are available on Korea and we will be using the PGI compiler and the GCC compiler for the sperimentale to experimentation. Today, ok,.

A

Any question before going to the demonstration of the tool: ok I will run the tool in my laptop, but you will run it on curry. So the way you will launch it is just doodle received instructions to launch a command that is called pw e, pw. Sorry, trainer.

A

And this is the graphical user interface it more or less resembles in just a minute. I, don't see here all the layout.

A

It is losing some part of the screen okay, so this is thank you write that you will see when you open the tool, so you will see, on the left hand, side a project manager that you have in many developing source code editors that you can manage different project product projects at the same time. So here you have the option to select relation okay, the one that you will be using in the afternoon heat or you can, even with clicking on file open project. You can open a new project.

A

For instance, list open I have several projects here for the demonstration. Let's open the PI example, click on tools, and then you have this by example. Here. One thing that is important to note is that the project is essentially a directory in your PI system. Nothing else at that, okay! Well, we store some hidden information that you will see during today that you can't recover and take it away with you with all the work that you have done during the practicals okay.

A

This is important because many times you will have real codes which around build system, completion system scripts to run the code, so we don't want to interfere in that part. So you just open the directory of you having using controversion system and get forget a directory is what is under version control for us, it's a directory and all the contents. What is a project for parallel trainer?

A

So whenever you open the project, the tool scans the directory and provides you with the contents of the project in this case, if you double-click on the example called PI, you will see the code that we are using and you will see this a special green circles here. What this means is that double-clicking in real time, parallel where technology has analyzed your code, heart phone call, the loop that you have elapsed has checked that some of the loops cannot be analyzed for some reason. You can report on that, but it will provide your green circle.

A

If your coder loop is a candidate for a loop that you can convert, you can paralyze. Okay, it has doubled. It has checked that it fulfills a minimum set of properties. That is to provide you with information that this is a loop where you can start to begin to internalize in terms of parallelism and introduce parallelism. Okay, when you click on these green circles,.

A

You are open this dialogue in this dialogue. We will be using in the morning these three panelists here. Well, you can see you can choose between open MP or open SEC, CPU or GPU multi-threading, not offloading. Let's begin with a simple example of open, MP, CPU multi-threading. What I want to generate is a multi-threaded version of the PI code to run on a CPU that has multiple cores. Ok, so once you select that you click on parallelize this button here and here it is, the tool has analyzed.

A

The code has discovered the parallelism and her added promise for you. These plasmas are correct and accelerate your code, okay, how the tool has done this? Let me scroll up remember that we have an approach based on patterns. Somehow the tool discover the type of pattern that you can find here so in the lower part of the of the UI to have three consoles one for building compiling your code, one for the execution once it is compiled and you run it- you, output, information in the console.

A

You will see it here and finally, the parallel world console. This is what parallel reports the messages of the analysis that has been done so here it is saying at line 27 the original line 27. It found a scalar reduction pattern where you have a variable that is somehow process using a commutative and associative operator. This is everything you need to know to determine that this loop can be situated safely in parallel.

A

Okay, so in the first line in the parable console, the tool will provide you with the path that has been able to discover in after la after the break, we will see their family the set of patterns that we have available in the Torah, this woman, and you will learn to recognize them after that, you can see these available policies and strategies for the variable son. What this means is that, once the code has been identified, the pattern has been identified.

A

The tool supports different ways of implementing parallel versions, using open, MP or open ACC and different programming paradigms. Okay, so you will be able to select which of these implementations. You want to generate with the tool, and this is done automatically by the tool you just have- to give the appropriate instructions. Okay, so here by default, it has selected a study number one scalar with action. We will see that in the act after the break.

A

What this means essentially, is that when you have the code, the tool has said: okay, it is the loop that you want to execute in parallel. Let's enclose it in a pragma that defines the parallel region. The parallel region with the pragma OMP panel is saying here begins the panel region. So far until this moment, you only have one thread at this moment. Different threads are created, so all the threads collaborate until the end of the parallel region. At this point, all of them are destroyed and only one continues.

A

Okay, this is what parallel means, for what means is that, in order for you to paralyze a loop, you need to divide the workload, the number of iterations between different threads. If you have ten iterations and two threads and it's ready secures the whole set of ten iterations, you will really run in a different problem with 20 Terrans. That is not what you want, so you need to divide the ten iterations among the set of threads that you have created. How did you do that in Olympia, with another pragma OMP, with the keyboard for PCs?

A

What is called work sharing, how to divide the durations of the loop among the threads okay and finally, you can disclose reduction it is. It is saying that the variable zone is implemented using as color reduction policies. We will see in detail in the afternoon after the break, how all of this politician statuses behave, but what I want you to see right now is that you will be able to choose between different policies and strategies and finally provides you some information about the generation of the called how the code has been implemented upon it.

A

If you don't know exactly what s calorie reduction means, the the training also comes with a knowledge base that will be improving and growing with with a different person that we release. So if you look at this message and you'll see this underlined text and you click on it, you will be presented with a glossary of terms where it explains what a scalar reduction is, and if you want to learn more, you can click in some of the glossary terms or learn.

A

More I will provide you with a more complete description and with examples in C, important of how scale reduction looks like in the languages so somehow some sample some part of the important knowledge about the part that you need to learn for parallel programming. It's also available within the tool. You don't need to go anywhere else to find it.

A

So, let's close that okay, so we have dinner a to conversion, let's compile it, how do we compile it? If you look at these buttons here, this is the Settings button. Well, you can specify the command. You will use to build your code, so let's use, for instance, GCC activating the open, MP support, Python, say LM. What this command means is that you will be using GCC. You can use PGI clan, we are not. You are not tied to any compiler. It's just the set of compiler that you have available in the system.

A

Cori has many compilers available. F, open MP is the flag that you use to activate support for open, MP pragmas. What this means that the compiler GCC will take this pragmas and we generate parallel code to implement the semantics of the parallel program of therefore problem, all the progress that you have specified here. If you don't enable this flood, these plasmas will be ignored and you will have the question go as simple as that. Okay, then, these are regular options, the name of the file and the name of the executable. So yes, I.

A

Cannot hear you, sir.

A

Some of them, but not all some of them, but not all, but you have mechanisms, will be showing next, how to specify in vitamin libraries vitamin variables that you need. Ok, let's first run a simple example, and we can do that. This big.

C

A

Yeah I can do it later. I have a make file there in the project, so I prefer to use this to introduce the command and the flux, because many of you is the first time that you used. You are using open B, but there is no restriction here. You put the command that you would execute in this path in the terminal. If you go to the terminal and execute the same command is exactly what the tool is doing behind the scenes. Okay,.

A

So if we specify this command- and now we click on this hammer- button build project here, it is the code has been compiled successfully and now we have the executable generated. Now we want to run it. How do we run it? We go again to the Settings button. We select the top run and we put here the run command.

A

Ok, so I click on OK now, I click on this play button to run. In the start, the execution and in a different console it outputs the output on a standard input, a standard output of this equation of the command. Everything you see here is the send adjuchas. You will see in the terminal from a terminal execution, okay, so this has run sequentially. So how can we run this? In parallel using several threads?

A

We can go to the run command and we come prepared here.

A

A

Threats equals one. What what do you see? What is this is that open, MP and open you see, provides you with pragmas directives, functions and with environment variables, so the environment variable you can control several things. One of the things you can control is the number of threads that you will be using in the parallel region. So we first specify one what will mean that I have a panel region with only one thread, so sequential execution. I can't run it like this.

A

Click the play button again and now I have the sequential execution one point five seconds so somehow, by default, the environment variables of the of my system was assuming a different number of threads. How can I change the number of threads? We have two ways to do it.

A

You can just change here to four threads, for instance, change the command and execute again.

A

Here it is, or you can use this advanced button here here- you can specify number of space number of threads equals four here. You can also specify any other LD library path.

A

Whatever you mean this replies to your question, okay, yeah decided two mechanisms that we provide to control environment variables, three advanced port and you can add them here or you can prepended to the execution command. Of course, if you invoke a make file, the make file inside can set up all the environment. I, remember levels I can meet or use.

A

I would need to check that I think it does I think it then, but the tool creates a terminal environment that I think inherits from their setup where that part about 22 was run that will probably be specified in the user manual. I will try to check it and and give you an answer to that. Okay,.

A

Okay, so we can also replace these by make and these by Makran, probably it will work and clean. We want to clean the project by McLain, so I can clear the terminals.

A

I can build. This is the command invoked by the make, so you can put any script that you need to build your code when I click on play. It runs. This is what is involved by. They make the run target of the make file that I have specified the project, and this is execution.

A

D

A

A

What's the question I'm sorry.

C

A

Principle, you can use an even system that you want and the way to invoke it in the terminal. You have to invoke it exactly in the same way in the settings box. So I, don't I, don't see any potential issue there.

D

A

If they like, you have the library pre-compiled, you just need to link it to produce the Secutor of your code. So in the end, this is having more and more options to the build command of your application.

C

A

Me let me check we can open. Let me check where this project pi is okay from the terminal. This is another terminal. I can go to.

A

Examples pi: this is what I have here so again, this is the same content that you can see there. You can type here make clean make make one. So this is essentially the same that is happening inside it.

A

Yes, any script that you have to invoke a sequence of commands that you need to run. You can invoke it from the terminal. If you can invoke it from the terminal, you can invoke it from the tool.

A

In the software, when you build here, you see the same messages between the starting and make this exactly what you see in the terminal when you run between a starting and this, this is a table you see in the terminal.

A

No, we are not providing a command-line interface at this moment. We have a different tool that we are designing. That is called parallel where analyzer that will provide you common line interface to this kind of capabilities. It will be designer intended Prabhas processing, compilation outside of the UI. It's something that is working progress.

A

Okay, any other questions.

A

In this case, you go to settings, you go to run, and here you can run the name of the executable if you run it like that,.

A

It will prompt, you will show you the usage message as any other tool, so you can add more and more commands in the command line. That's easy enough!.

A

You say something like.

E

A

I redirection, my nurse input, so that is yes, if you can use it in the terminal you can use it here. Is your dog? Is your call this prepare for that you can use it.

A

More questions.

A

Okay, in this current version, parallel were trainer 1.2. We can't discover opportunities for parallelization in one file at a time. What that means is that, if you open the file, all the functions that you use in the for loop are defined within the same file. We can analyze it. We can discover parallelism, no problem with that. What is different is that if you call one function that is defined in another second file, second file dot, C.

A

That is something that is working problems that we are about to finish and that's a feature that is expected to come in parallel 21.3 that will. That is a feature that is needed for beacons, because you usually call functions in one file that are defined in different files. So you need to somehow analyze several files altogether. That's something that's working problems. We are about to finish it. You could probably come in parallel trainer version 1.3 at this moment, one file at a time we are working on that. Yes,.

A

Wait more questions.

A

For the point of view of the analysis of parallel main is just another function, so what we do is we analyze the code and we try to find a function. I know the furniture that I call from this function to try to discover opportunity fertilization. So, for that point of view, main is just another. One main is important when you build the code to in a digital, but not for us to do the analysis.

A

Okay, in big application, you usually have multiple files with multiple mains, so it is your building system that needs to treat its main separately, but from the point of view, for our analysis is just another function.

A

A

Okay, one more thing I want to show you imagine that you are working on your project. You have made this change. I know you don't want this implementation. What did you do? You need to recover somehow during inauguration? Okay, so in order to facilitate that workflow does great work here. We have to have this angle here. You cannot see it, but this is an angle that when you open it or not, you can see it when you click on it. It opens this part over here.

A

What this means is that whenever you click on the green button, green circle choose the options: click on parallelized and the code changes. This is changing your actual P datsuk profile, but before doing that, we save a backup copy of the code that you had before inserting the progress okay. So this appears here this is kind of a built-in versioning system that you have to keep. You can maintain different parallel versions that are of interest for you and with your project. Okay, so, for instance, I can't take this. You cannot see there.

A

Let me if I click on original one. You can see that this is the original code with no progress if I click on this around to the left here, what I do is I will restore this version in the actual file of my compilation.

A

So if I click, it will ask me, are you sure you want to do this because the tool is going to replace the contents of PI dot C with the original version? So you need to know what are the version that you have here some how to save versions for your milestones as you make progress, so the view something that you don't want. You can restore one of these persons to begin again in a in a check point this day.

A

Sorry I have the air. Here it's difficult to hear.

A

Okay, what did I do to me? Okay, let me cancel. We were here, okay, so, let's click on their versions, and now you have different files here. Every single change of paralyzation that you generate generates a backup copy. So now, if you want to restore one of these original copies to this car, those changes you can just click on this button. Here restore this version to the editor.

A

You have to confirm that passage will be a bug right in the contents. I say: okay and now I have again the same version that I had it. Okay, these versions. You can delete these versions. Will you click here? You can confirm the deletion of the versions. Let me just for the sake of clarity, delete all these versions of tests with it. Yesterday.

A

Okay, so the suggested workflow even original one I can click delete it. This is the workflow for the practicals is as follows: click on the hint circle generate openmp, CPU, multi-threaded oppression, parallel eyes. It will generate the original one okay. So now you can click on it. If you want to rename it it's up to you, but what you can do is by clicking on this button.

A

You can save a version explicitly not automatically so you say: ok I want to create a new version that is PI using openmp, using the reduction close and now I have my version here at this moment what I can do is restore the original version and again click on the buttons and say: ok, I want a open, ACC GPU of loading version. I click on paralyze, and here is your open ACC equivalent implementation? Ok, so you can see some similarities between open MP and open ACC up to this level.

A

Parallel parallel same semantics: well, the region begins and ends for and loop, more or less same semantics for sharing the durations are sharing on the threads production production. The scalar with action pattern has been found, needs to be computed as a reduction in open, MP and open ACC, and then you have some additional clauses that we will see later. Ok, but you can compare them as you can learn from that, and with this you can generate all the versions that you want. I can generate by using a cc.

A

Reduction, don't see one more thing we can show at this moment. Okay, you have been doing your practicals. You have been doing all of this work and the question is now you're working here you have access to quarry to para ver trainer. The workshop finishes to go back to your office. How do you take away all of your work? Do you have to lose all of this? Do you have to make copy and paste of all of this? You don't have to do it. So, let's see how the things are stored in the file system.

A

If you remember this is what you see so far, but everything that that the tool has been doing is not lost. It is a store under the same directory example spy. So what is it? It is in a hidden directory.

A

Named dot, PW T, so under this, the tool is where it keeps track of all the copies all the versions named versions of each file that you open and you work with. So this is the quencher code. This is code you can compile and run. So if you compress all this directory with this hidden file, you have everything you have you have done during the practical. You have to compress it to take it away with you. This is exactly like control version systems work.

A

They create this hidden dot, get directories dot, SVN directories, so we do it in the same way, so trust not to interfere with with other tools, but to make it very portable you can just compress it take it away to another file system. Even if you have the tool you open it and it will recognize all these hidden information. Okay, what you lose is all the usability features that you have in the graphical user interface. If you don't have the paragrah trainer, but you have all the words you have them.

A

Does the important part when you come to workshops on your car? You use the tool to generate ratios, so that's important. Why it's important what you need to define correctly the directory of your project, because everything that you do will be stored here under the very tree you have to.

A

D

A

There's do something: let's go to the terminal: let's copy P dot C as new file you'll, see.

A

So somehow you have been working with another tool to further develop your code. You have created new files that you can integrate in your build system that you can. You can do one to analyze with the trainer. Is this what you mean to create more files using a separate tool.

A

Okay, now I think I understand the question. So if you do it from the terminal, whenever you open the project in tres, casa directory and the file appears here, okay, so your question is, if you can copy the file manage the file system from within the presentation from here, create a copy.

D

A

From the mini file, these are the options you have.

D

A

From from the graphical user interface, we don't have the capability, because that's something that we discuss internally, because these are the kind of features that you have in professional in ideas, professionals, code editors: you can manage all the father you have in the file system from the graphical user interface of the Scala later. But here the problem is that every single user that we talked to they usually use different editors. They have, they prefer editor. They don't want to change their editor for development.

A

So we have tried to minimize the amount of features that we have to manage projects in the trainer, because the user prefers one one Eclipse other one to the Creator other ones by so every single user developer has is preference for a different development environment. So here we decided to minimize the amount of features to manage the project, because the user will be doing that to your preferred professional environment. That's a the people who have got from.

A

Creating a new project is just creating a new directory in the file system again know from the not within the GUI again, for the same reason that the Hat try to explain before in professional view is development environments. You have all the capabilities in the project manager to manage the file system created actors, move files around directories delete them, but we decided not to implement it in this case, because this is not at all intended. The graphical user interface is not intended to replace your prefer development environment.

A

Okay, let's, let's try to do it.

A

Okay, this is the project PI. This is the file system. I will create another copy.

A

Somehow the QA checks for changes and it updates automatically, so you can't have it open and to preferred it any open, and it will be aware of the changes.

C

A

Remote version of the software.

A

What did you mean by remote that you work open the trainer locally to work locally and at some point you want to remove the lunch on curry in order to do that, you just have to set up the appropriate execution command to transfer what you need to through Islam through SSH to copy written. You need to quarry and launch the process. Then it's up to you how to do it, because independent organization of your project, sorry.

A

So you run the trainer with visualization tool to run the visualization tool from curry from your laptop.

A

It may be interesting to analyze. How do you use it to see if we can somehow do something similar for the period.

B

So I think that probably.

A

It's a question correct.

B

A

Okay, I can take a look at it to see how it works. I'll see if we can implement a similar feature here.

E

A

Essentially, it is local execution of the tool in your laptop. You got locally at some point, you say, I want to launch is remotely on Cory or another system, and that's what it this tool.

A

Okay, the answer is no, it's not available yet, but it's something probably interesting to look at it to see how the other tool works.

A

Yeah I opened only one file, but if you open several files.

A

If I change to new file, you can see that it has no versions and if I click on activate again PI, it comes into its own versions. So this hidden directory keeps track of the versions that exist for each of the files that you have here, so they are associated in the UI.

A

The u.s. cans any change in the directory- that is the path were, the project, is located. Files directories, anything that you are in the file system.

A

A

We still have five five minutes.

A

Let me show you, let's restore the original version and.

A

Let me show you, then, this part that you will have to use in the practical. Okay, if you remember, we said that for its pattern, we have different ways of implementing the implementation. Okay,.

A

Gpu offloading with open MP opening.

A

So here again you have the beginning and the end of the parallel region they directive, to make the work Sharon.

A

But, additionally, in deep enough loading to GPUs, you need to transfer manage data transfers in open, impede you do that with a pragma target and map is the clause that he said several options to is transferring information from the CPU to the GPU from is from the GPU to the CPU and to from is to the CPU GPU computational at the end, from the GPU to the CPU in open sec, I think I have this version here.

C

C

A

At this moment, this is something we discussed yesterday. Best practice, affordable programming recommend that, instead of using parallel for within target that you used team, distribute parallel for to give the compiler freedom to generate better quality code for the GPU. This is something that will come in 1.3, probably, but it is something that we will definitely do at this moment. You can do it by editing here. It's just these are complete editor, so you can come to the editor teams.

A

Distribute you save the file you now you can save name a different version.

A

Themes- and you can compile it already- okay, do something will will add in the creative iterations.

A

We will see that in the in the after the break, why it's important to specify teams, distribute parallel on the GPU on how to do that in in open ACC that they have the equivalent can work, reversion, notation and terminology. We will see that later, okay, so yes, this is a complete editor, as you have in any other two. You can modify save versions and keep on working with them and all the verses you save will be stored in the file system. So you can take away all the what you have done.

A

Okay, so in one minute, I think I can't finish.

A

The last demo that is okay, what happens if I want to paralyze this loop in openmp or in open ACC? You seen a different study. Let's say I want to use atomic protection, I click on paralyze and compare these two versions. They are exactly. They are correct, implementations of exactly the same original, sequential code. The difference is: how do we implement this scalar reduction operation? We have several strategies.

A

The default strategies were never it is available in the standard use the reduction clicks, but maybe the real there may be situations where you have a reduction operation that is not supported by the standard. All you want to use instead of the scalar variable you need to use. Arrays arrays are not supported for reduction operations in open, MP and open if you see in general, instead some exceptions. So in that case you can still generate parallel code. What you need to do is the rest of the implementation is the same.

A

It's only changing the way this reduction sum is handled in these cases to atomic reduction. Put it so sorry, the reduction clause and in this cases, by guaranteeing mutual exclusion. When it's read is computing and accessing the celebrate okay, so what I want to show you here is just that using this part of the panel.

A

I will restore again using this part of the dialogue you can control which of these implementation strategies you want to implement, and the trainer will generate different parallel versions that you can later compile, run measure the performance and select the one that is faster on your system. Okay, so we will do, will explore this in detail in the practicals after after lunch, so we will go into details of all of this okay, okay, so now, yes, I think it's time to stop to make a coffee break. Thank you so much for being so interactive.

A

It's really a pleasure to have so many questions from the audience is not over the case really great. Let's keep on like that. After, after the ring.