National Energy Research Scientific Computing Center (NERSC) Codee Analyzer Training, April 2022, 12 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 03 Codee parallelization capabilities

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, let's continue so I hope you had the first sense of how to use kodi for several use. Cases in particular for once, cody succeeds with its technology to identify opportunities in your code, in particular, for offloading to gpus how you can follow the instructions and the suggestions to actually used code error, writing capabilities to annotate your source code with openmp and open ecc.

A

So for those of you that completed the lab- and that's a really great for those of you that are still in the middle of of the lab once I finish this presentation, I will present the second lab, which is very similar, but it's using the matrix matrix multiplication code, and then we will have more time for for you to continue with the two laps okay.

A

So um let's continue so what we would like to see now in this in this deck is we want to identify uh which are the of some of the key challenges that as a programmer, you will need to address in order to create code that runs on on a gpu? Okay in particular.

A

Some of the challenges that are really typically pointed out is how to make an efficient usage of the memory for data transfers, in particular, and for execution on the gpu how to exploit massive parallelism. That means, if you can create thousands of threads on the gpu, don't create only hundreds of them and minimize data transfers, because this is one of the typical bottlenecks that you can find.

A

So here in particular, what we will be seeing is something related to memory usage. That is how our design or our decision to represent our data in a particular data structure, how this can influence or have an impact, positive or negative in the data transfers of your or in the complexity of coding.

A

A solution to use take advantage of the gpu, particularly using openmp and open ecc, and we will see you this uh using the matrix method. Multiplication example on permuter.

A

So very briefly, uh we need just to remember very briefly: what are the differences from the point of view of programming of a programmer between a cpu or a gpu? In the end, both of them are executing threads somehow in the hardware, but typically the numbers of threads to be executed on the cpu is relatively low.

A

Typically, 816 is a reasonable number, while in the gpu you might expect to have many, many more typically hundreds of thousands of threads executing at the same time.

A

A second difference is that how the threads are coordinated in the hardware level on the cpu. Typically, we have control of the threads, and these threads are executed, all of them at the same time, and we decide how the communicators synchronize with each other on the gpus. Typically, these rules are not so straightforward. They are more complicated and, for instance, you have seen that threads on the gpu are grouped in what is typically called, for instance, gpu warps that executed all of them at the same time, using the same instruction instruction counter.

A

So this has an impact on the way that we group or execute the threads on the hardware level. Also in the cpu, we have a memory system that is relatively simple, although being complicated enough.

A

That is several levels consisting of the registers of the processor, the cache the memory subsystem and up in the in the memory hierarchy, while in the gpu we have different kinds of memories, from cache to texture, to a scratch path to a global memory, and not all the threads have access to the whole memory, depending on the thread executing at a given time.

A

The amount of what is the part of the memory system that it has access to is restricted, and this is this influences that we program for it, and also one of the things that is common in both architectures is that in the end, the exec they exploit some kind of vector mode on the cpu. This is typically exploited through ex through scene instructions or vector instructions, while in the gpu more or less the equivalent is having all of the threads within the.

A

For instance, a thread work to execute the same instruction at the same time on different data points, so somehow cmd a vectorization is also important to take advantage of performance on the gpu. So significantly, there are significant differences. Even this high level description of the of the architecture you we can have a sense of how different they are from the point of view of programming for these different types of hardware.

A

So one thing apart from the archival architecture, differences, there is another thing that we need to consider that has a big impact on performance that is in the end. uh Gpus are different, separate memory than the cpus, for instance. This is. We typically are told that when we launch a program on the gpu, there is a thread that starts execution on the cpu side, the cpu side, somehow at some point of loads computation to the gpu.

A

This means that the gpu needs to be aware of what instructions are going to be executed on gpu, but also all the data that is needed to execute those computations need also to be transferred from the host memory that is accessible by the host on the cpu side.

A

That information is to be transferred or copied to the gpu so that the gpu threads can have access to that data on the device memory and when the result is computed, the results need to be transferred back to the from the device memory to the host memory, for instance, for the single thread to continue and print out the solution to the to a particular problem.

A

So, in the end, when we're talking about maximizing performance, what we see is there are several rules that we need to consider basic rules. In the end, we need to transfer data from the cpu memory to the device and keep it there so that the computations of loaded to the gpu can do work and have enough work to do, and they can reuse as much data as possible and make an efficient usage of the memory bandwidth of the tpu and in the end, transfer all the results back to the cpu.

A

So this gpu host driven executable model. If we look at it from the point of view of challenges in the end, we still need to identify what parts of the code need to be offloaded to the gpu. So finding opportunities for offloading is one of the key challenges that we still need to address.

A

Another thing we need to to consider is how to optimize the memory layout so that, in the end, data transfers between the cpu and gpu are executed efficiently and also when the code is executed on the tpu. It is also executed efficiently by making an efficient usage of the memory a subsystem available in the in the gpu and also in order to other challenges that we will not be addressing today and tomorrow. But that could be addressed in the future.

A

Is how can we spawn more and more threads to take advantage of the thousands of spreads that are typically available on a gpu or how we can minimize data transfers across loop nests or minimize data transfers through simulation iterate iterative loops that we can identify, so that the data can be reused on the gpu as much as possible? So.

A

For the purpose of this of this course, we are going to focus on these three challenges. The challenge number one that we have already seen with by example, is how can we identify in the code loops that are candidates to be executed on the gpu?

A

This is what we call opportunities for performance improvement, in particular for of loading to the gpu okay, and it is important that we always guarantee correctness both on the cpu and on the gpu, because in the end, in the hardware level, we will have several threads, typically executing concurrently the same different iterations of a given loop, so identifying these opportunities for loading is one of the challenges, so this will be present, always in every single loop that we want to flow to the gpu.

A

The second thing that we will be the challenge that we will be addressing in the second lab of of today is optimizing, the memory layout for data transfers and in particular, we will be dealing with what is called array shaping or shaping of arrays. That is how can we understand if one my 1d 2d 3d 5d array- that is logically a multi-dimensional array, how the data is actually laid out in memory, because, depending on how I actually code or use the data structure, to actually implement my matrix, my multi-dimensional metrics?

A

This will have an important impact on performance and even on correctness. So here we are going to see in the challenge number two: how selecting the data type the data structure actually impacts on the shaping of arrays that we need to manage with openmp or openhcc, and the third challenge that is about identifying defects, data transfers.

A

We will be seeing tomorrow, as also as depending on how we use what data structure do we select. We maybe have a need to address the problem of deep copy. That is whenever the data is, is not consecutive in memory, but is spreading memory and we need to follow pointers to actually access to a given data point of my matrix or my data structure.

A

That has a complexity that needs to be managed both on the computational side, but also on the data transfer site, and it means typically, this is not managed automatically. This needs to be managed explicitly in our programs, so these are essentially the three main challenges we will be working with today and tomorrow.

A

So in order to understand why tools are needed, remember that compilers are here to help and in particular, compilers can implement application program interfaces like openmp or openacc and application program interfaces are designed for us as a program as programmers to specify what the compiler needs to do and how it needs to do it. So we need to specify the what and the how it is our responsibility to as programmers to carefully uh indicate what needs to be done at each moment in time.

A

We cannot expect a compiler implementing, openmp or pcc, for instance, to automatically detect opportunities in the code to be offloaded to the gpu. This is not something that you can in general, expect from a compiler implementing, openmp or openhcc. Also, the compiler will not guarantee correctness. It is you, when you add a pragma.

A

You are telling the compiler what to do in order to manage an offloading opportunity and how to do it, how to actually code the data transfer. So it is your responsibility as a programmer to guarantee that you are using the application program interface interface properly, so the compiler can do the job of the hard job of translating high level practice that are good for the programmer into lower level code that can execute efficiently at the hardware level. Okay. So, in the end, we as programmers we need.

A

We are responsible for making a good usage of the application program interfaces so having additional tools apart from compilers. That can help us to make a good usage of the application program interfaces.

A

It's always something that can save a lot of time and can help us to produce a fast and correct code for a particular cargo architecture like a multicore cpu or a gpu.

A

So having that in memory, let's focus now on the problem of array shaping so essentially a reshaping is a way that we can specify to the compiler how the data structure looks like in particular how an array looks like which are the size, 1d, 2d, 3d, 5d arrays, and which are the ranges which are the amount of data that can be stored in each of them in each of these dimensions.

A

All providing this information provides very useful information for the compiler to do many, compile time, optimizations and in particular, also to do correct memory, allocation and even memory data transfers from the cpu to the gpu, by taking a look at the array shaping information for the gpu of loading, a problem. So how does do we specify? Typically, an array shape the shape of an array?

A

Typically, we provide the name of the array x, I a b c and we typically provide within brackets the first element that is relevant for the dimension and the last element that is relevant for the dimension. Typically, for instance, if we want to specify that we have a one-dimensional array x of n elements.

A

That start in zero, for instance, we need to specify if the element is called x x and within brackets 0 columns n indicating that it has an elements in the fourth dimension, and we can use the same notation to specify the shape of the array in multi in multiple dimensions. Okay- and you can see that both openmp and openhcc provide a syntax to specify array shapes.

A

A

If we are working in our code with 1d arrays, typically, these are called vectors. We can have a vector of five elements, a vector of ten elements.

A

So typically, when we allocate a vector a one d, a array in c c, plus plus of fortran, the programming language specification, typically guarantees that all the data points, all the elements are stored consecutively in memory.

A

So all the elements, the compiler guarantees that the array is allocated consecutive memory and that all the elements of the array are stored in consecutive locations so and this event is guaranteed by using a statically located memory like declaring the second declaration, a point, an array of nine elements of type double or by using malloc, for instance, to allocate this in dynamic memory, both of them guarantee that all the data is allocated in consecutive memory.

A

So this is very good information for the compiler because it can trigger and enable compiler optimization that otherwise could not be possible. So, let's look at what happens in when we are trying working, not with 1d matrices of one dimension, typically called vectors, but we are working with matrices of two three. Four arrays or dimensions: okay, here um from a logical perspective, uh we have a in this case a three by three matrix, where we have different elements in different rows of the matrix.

A

But what happens from the point of view of the programming language c, c, plus, plus and fourteen- have different rules to lay out the data in memory for multi-dimensional arrays, and this is what is typically known as cc plus plus implementing a raw major storage. That means that consecutive elements consecutive rows in that two two-dimensional matrix are stored in consecutive memory locations, while fortran, for instance, uses the other the opposite approach. It is column major, so consecutive elements in columns are storing consecutive elements in memory, so, in the end, the developer.

A

We as developers need to be aware of how the programming language, which are the programming language rules, to lay out the data of our multi-dimensional arrays, and this is not only for the for the programming language. Also, the application program, interfaces of openmpr or an openscc have rules to lead to deal with a multi-dimensional array and with data layout, okay. So, let's take a look at, for instance, different declarations of this array.

A

If we have a logical matrix of three by three elements and we declare it in c as a double array of three by three. Essentially, this is using statically allocated memory, both in c and fortran. So the language guarantees that all the nine elements will be stored and allocated consecutive memory in our programs, and this is good because all the data that needs to be transferred to the gpu, for instance, are in consecutive memory locations.

A

We can do that in one single transaction in in memory, so we just need to consider that cc, plus, plus and fortran have different rules. The data within the consecutive memory region changes the position of the same data and once in c consecutive rows in consecutive number of positions in fortran consecutive rows, columns and consecutive memory positions.

A

But what happens if we change the data layout or the way we declared our matrices? What happens if we use malloc pointers? Typically, what we see is that we have this double asterisk asterisk representation.

A

What this actually means is that the nine elements of our logical array are stored in consecutive positions, but for its role in in this in this case. So what we have is the first column row in consecutive memory, the second row in consecutive memory, but between rows. We don't have guarantee that all the data will be stored in consecutive memory. So we need an auxiliary array to actually keep track of the pointers that point to the beginning of its row. So this is essentially what we do when we use dynamically allocated to the arrays.

A

So in the end we are responsible for allocating the memory we are responsible for setting the correct values in the pointers that point to the beginning of the memory positions. So when dealing with openmp openscc with data transfers, we are also responsible for dealing correctly with the pointers that point to the elements in the in the malloc arrays.

A

So when we, when you look at real codes, you in between the statically located arrays that we have here and the double pointers, for instance the fully dynamically located arrays. What we observe is that we tend to use intermediate code representations that somehow has half the advantages or and the pros and the cons of statically allocated arrays and dynamic allocated arrays, let's consider, for instance, this alternative way of representing the data.

A

So we cannot use dynamic memory to allocate nine elements of the data consecutive in memory, so we have the benefits of statically located arrays that have all the data elements consecutive in memory, but with this auxiliary arrays of pointers now pointing to different offsets within this unique array of nine elements.

A

What we have is we can preserve the the ease of use of multi-dimensional array notations. You can see in this loop that we can use the notation a and between brackets fourth dimension, I access to the eighth row and in the second dimension I access to the jth element within the eighth row. So I don't have to change the way.

A

I am coding the access to the elements in the array, but this makes the um data layouting memory a bit more complicated, as it is again my responsibility as a programmer to use it properly in the allocation.

A

The allocation and during the computation stage, so when you look at real applications typically, we see also this implementation that uses dynamic memory, because in dynamic memory we don't have limit or major limits in the amount of data that we can store in one single array, and what we typically see is that all the data of the matrix is allocated in consecutive memory locations and we have one single pointer to this complete array with all the data points, what happens?

A

In contrast in content, what happens is that we need to adapt the computation, the loops instructions of our code to actually explicitly use this new data structure representation. So here, instead of having access to the position a a row, I column j. What we need is to actually explicitly code the element that we will be accessing from the beginning of the first, the first element of this array. This is what we see.

A

Typically, as a of I multiplied by the number of elements in a row plus the coordinate of the column that we want to access to. Okay. So essentially, we can see that we as programmers we can. We have different alternative ways of designing our data structure to start to actually store all the data points. Until until this moment we have seen a dense representations of tense matrices.

A

What this means is that, in the data, we see all the elements of the array stored of the matrix stored in memory independently of the value if they are zero, we store 0 values if they are different from 0. We store this different from 0 values, but what happens in also in many real applications is that the data structure needs to be adapted to huge large matrices, with a big number of zero elements.

A

So here many times in different scientific domains, we use sparse matrix representations that, in the end, they are making the axis more complicated, because we have all the data stored in consecutive memory positions as we had in dense arrays. But now we are not allocating memory for zero values, because we know that those elements that are not allocated are essentially implicit, er, zero values, but this makes more complicated to know exactly, which is the position.

A

The coordinate of this a data element within the original logical array, and for this we need to use auxiliary, arrays to actually reconstruct or repeat or determine the row and the column of the element in the in the new, sparse storage format.

A

Here, in this example, we are showing a very widely used, sparse storage format that is called the crs format that essentially uses one array of data in this in this case is name a that stores all the elements.

A

One call index matrix that stores the position, the column position within the row of the matrices, and it uses an auxiliary array that indicates they start the position where each of the rows starts. So look at the right. We are just traversing all the elements of the array, but look at the complexity of how this needs to be coded again by choosing a data representation, either dense or sparse, to represent in our cc plus fortran programs. Our logical matrices.

A

We need to adapt the code to actually use that representation. So when it comes to coding for gpus, the key question here is: okay. How does this affects in practice to open npr, open sec practice? In particular? How does this affect, for instance, to data transfers? We know that we need to code the application being aware of the data structure that we are using, but what happens with the data structure?

A

So here you can see that we will use an example in the lab that is, we will use a representation of a 2d matrix using the double pointer representation. This essentially this representation. We see here in the slide 12 and what we see here on the right is different ways that you need, in which you need to manage the double pointed data structure in openmp and openscc in openhcc. You can see that you have in the copying copy out and copy out clauses.

A

You have the name of the array, and you also have this notation for array shapes, so you are specifying how many dimensions the array has in this case, these arrays have two dimensions: a b and c and you're also specifying the first position and the number of elements in each dimension. So you are telling that array array a is consists of m multiplied by p columns elements in the array.

A

So, from this perspective, openscc provides a notation that is quite easy to use and easy to understand, but look at look, for instance, at the openmp implementation in openmp implementation. The recommendation is to explicitly transfer its range of a consecutive elements explicitly using a enter data, a clause or directive in the code.

A

So here you can see that we need to transfer the array of pointers of the double pointer and then we need to transfer all the rows independently so that on the gpu side, the pointers on the matrix of pointers are correctly set by openmp. Okay. So somehow we want to show here in this lab is that whenever we choose the data structure to represent multi-dimensional arrays, this has an impact on the way we code, our algorithms in the in the program, but also it has an impact on how we need to use.

A

For instance, data transfers of openmp or open ecc and the each program interface provides a different alternative, best practice recommendation on how to code these data transfers. So this is going to be the the objective of the second next lab.