National Energy Research Scientific Computing Center (NERSC) Codee Analyzer Training, April 2022, 12 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 05 Codee identification of defects in parallel code

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, thank you. Thank you. Everyone for for attending this day, two of these first training sessions of the kodi series at nerdscan old cf.

A

Yesterday we had a first introduction to kodi and we saw how we can follow a simple and predefined sequence of steps using kodi command line tools to finally port, your codes to gpu, and we use as examples a very simple example for the pi code and another simple, but a bit more complicated, madmol methods, matrix multiplication code, and today we plan to do the following in the three hours we have ahead before the break.

A

We plan also for a 30 minute break again at around 7 30 hour time, so within one hour, one hour, 15 minutes, and for the first part, what we will be seeing is how you can use kodi to detect defects in your openmp or open sec codes. What this means is that, once you have taken your sequential code, you decide what parts of the code want to offload, and you add practice to it. Kodi can understand the code, can understand your practice and check if the pragmas are correct for the code that you are uploading.

A

This is what we call detect effects in your gpu enable code.

A

So, during before the break, we will be first seeing very briefly, which are the defects currently supported, and we will present you with an exercise that we can cover, probably in the next 30 35 minutes, for you to use the kodi to detect one defect in gpu gold right after that, we will see another short slide deck um to understand. What is the completely added complexity of real codes, why things are similar, but significantly different to having isolated kernels like biomatmul?

A

What are the challenges behind real cost or the additional things we need to consider, and for that we will be just reviewing very briefly or enumerating some of the difficulties you have for real codes, and we will be presenting you with the lulac mk example, which is a simplification of the lulac choral benchmark, but still contains functions of the coral of lules. That are real. That literally the same part of the a of the code of the rules benchmark, and we will present you how.

A

I show you through slides only how using the same sequence of steps you can port your code to the gpu.

A

Also, before the break we plan to do a demonstration of kodi using fortran codes. This fortran support is experimental, yet we finish a development internally one month ago and during this term until june, we plan to be testing it internally or with early adopters that want to test the fortran code early early releases of fortune code.

A

So today you will be seeing the pi example and the matrix matrix multiplication example written in fortran the same examples than today and yesterday and now you will see the same workflow in koti, producing similar results for fortran code, so the codes be imported to a gpu and then we will do the break and after the break we will stop talking and you will have opportunity during a free handsome session, to continue doing the labs to do the new labs using lulac, mk or as helen suggested.

A

We encourage you to bring your own codes and try to follow the initial steps to get started with kodi so that we can help you during this session. If something happens, then, of course we always have can continue conversations after the course through appointment sessions. That nurse is going to facilitate in the upcoming weeks or months okay. So this is the the the plan for today.

A

So, let's start with uh the next lap, the third lap that we propose, that is, we saw a set of gpu challenges that we need to address, and today we will be seeing how we can use kodi to identify defects in gpu code, particular defects in data transfers, data transfers that are coded using openmp or openscc pragmas, but that seemed correct, but that they are correct incorrect for some reason.

A

So from all of the capabilities that we showed yesterday presented yesterday about goodie from the actions report that you see in in the performance optimization report produced by cody, we have opportunities, recommendations, defects and remarks today in this lab, we'll be focusing on defects.

A

So if you look at the catalog that we have open and we encourage you to use review, learn from it and of course always, please feel free to contact reach out to us or tuners. So we can identify new actions or elements that we can add to the catalog, we're always learning and working collaboratively with the with the community. On this. So from the open catalog that you can see in the website, we will focus on the section of defects where we have today implemented 11 defects using the software.

A

You can also take the defects and navigate the defects in a different way. Remember the six stages of the performance, optimization roadmap, three sequential stages, optimizing, sequential instructions, simplifying the control flow, optimizing, the memory usage and the three three stages related to parallelism, vectorization, matrices and offloading.

A

So today, in version 1.3.1, the 11 defects correspond three of them to offload into gpus, particularly data transfers, and the remaining eight correspond to incorrect code in multi-threaded code using openmp, essentially so from the challenges we saw yesterday identifying opportunities, we know how to find that in the performance optimization report produced by cody, we saw the importance of array shaping the importance of selecting and implementing a coding structure and a structure in our programs to represent matrices.

A

How can this this can impact on the way we need to manage data transfers? So today, what we'll be seeing is how these data transfers, due to the data structure that we have selected, can be incorrect, although they look like being correct- and this is related to the well-known problem of deep copy, so copying complex structures that are built with pointers and navigating the pointers to move all the core data correctly from the cpu memory to the gpu memory.

A

Typically, what is called deep copy so as usual, and we always want to remark this many times it's up to us developers that we are responsible for making a correct usage of the language or the programming language of the compiler that implements and support the specification of the language and, of course, the parallel programming api that we use, openmp, opencc or any other one. We need to learn the rules and we need to learn to use it appropriately in a proper manner.

A

So it's up to us to use it consistently so that the compiler can do the rest of the hard work. So, for this case we're going to use part of the materials we introduced yesterday. If you remember, we have this multi-dimensional matrix, let's call a 2d matrix that can be. This is the logical, structured layout of the data in our in our minds, but this is not necessarily how data is actually located in the physical memory of the computer.

A

So in order to control this, this is up to the programmer to decide which data structure is going to use, to represent logical matrices and depending on the data type, and how we declare the matrices in the pro in our cc, plus or fortran program. We can have the data consecutive in memory. This is highly desirable for performance, because when we have all the data consecutively memory in a natural manner, we can traverse all the data set and this enables to make efficient computations.

A

This enables to implement efficient message, passing or efficient communications of data, because data can be packed in one single hardware instruction, but when we, when we need to use large or extra large amount of data, typically statically located memory is not enough, it has its limitations. So we need to use the heap, and so we need to use dynamic memory allocated in the dynamic memory of the computer, and then we enter in the world of pointers.

A

So here when we have a double pointer implementation for a logical 2d array, logical matrix in the in the in the double pointer in cc, plus plus, we don't have a guarantee that all the data for the rows is stored consecutively in memory. It's important to remark, because when we want to transfer three rows in one single operation, we cannot do it, because all the nine data elements in this example are not consecutive in memory.

A

We need to do it by segments first central row number one next send row number two next send row number three: if we try to send all the nine elements, starting in the element, the position of the first element of the logical matrix, this will fail. This is what is remark here. So this relates, if you remember- and now I'm jumping to the to the lab.