National Energy Research Scientific Computing Center (NERSC) Codee Analyzer Training, April 2022, 12 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 07 Putting it all together in real codes

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so hopefully, at this stage we all have a clear understanding of the systematic and predictable way that kodi proposes to optimize the performance of our software from very simple codes like pi, to something more complicated at matmul, even to detect effects. So, but we know that when we have a given piece of our code that fulfills the properties of being a typical kernel that we want to optimize, then we can use everything we have seen so far.

A

But in order to for real applications to get at this stage, where we can unlock these capabilities, we need to um essentially address the complexity of real software. So here in this putting it all together, what we want is to just enumerate highlight raise awareness around some of the topics that make real applications more complicated to from the performance, optimization perspective.

A

So for that purpose um we are proposing relation k. This is a simplification of the lulac coral benchmark that we already used in past editions of training sessions at nerds, and now we have essentially reviewed all that material and how to approach the optimization of fluence k using this new predictable pathway that we have in kodi today.

A

So essentially in relation k, we have selected functions and pieces of code that are exactly the same pieces of code that you can find in the real lulay. Coral benchmark and others have been simplified just for the sake of teaching just to enable focusing on specific challenges from the programming point of view and simplify a bit the complexity of real applications, but still be able to demonstrate which are or illustrate which are some of these challenges.

A

So here you can see in this table that in this course we have addressed how to find opportunities for offloading not easy. Many times we need support of tooling to do that better and faster. How to optimize data layout for data transfers and also how to how the data transfers and potential defects we can introduce in our code are related to the way we represent. The data in our codes, so it is different, a logical interpretation of a matrix than the real implementation in in our course, but real codes are have more than that.

A

We typically have multi-dimensional arrays, which have deep nested loops. Can we exploit faster, computational, more computational power and gpus by paralyzing, all together, at the same time different nested loops? This is what we typically address when we say the challenge of exploiting massive parallelism through loop nest collapsing. We will try to address that in the following course.

A

Another thing is: we have been seeing how to surround a given kernel by data transfers, but what happens if this kernel is not executed once it is executed in a simulation loop repeated many times, we need to understand the wider picture of the application to see how to minimize data transfers. Maybe some input data read at the beginning of the program and only need to be transferred once not one time per each iteration of the simulation loop.

A

This is what we call here: minimize the transfers through convergence loops, and sometimes we don't have one simple loop. We have several loops chains, one after another. Data transfer should also be minimized and group several loop nests within one single data transfer. So this is minimizing data transfers across consecutive loop nests and for this session we have selected the last one that is in real applications. We typically don't have a loop nest that is self-contained.

A

These loops. We typically call another function, calls that can call more function calls so we can have nested function, calls in our applications, and this is something that we have arises, not naturally in real applications. So this is the challenge that we have identified called here. Identify auxiliary functions to be offloaded, okay, so.

A

Before telling you or showing you what additional materials that we will not be viewing in this course, but just point you to all these materials are available. Remember again, we are responsible for, according to usage of the api, openmp open cc of the programming language and, of course we have the support of the compiler. We have the support of additional tools like kodi, but in the end we developers must guarantee correctness and check that the performance increases so in real applications. Apart from what we said here, we have just enumerated some of the challenges.

A

Some challenges have to deal with the platform or the or the complexity of real applications. You need to deal with not only c, not only fortran, but a mixture of cc, plus for some programs or routines that need to interact between them. You typically need to use different compilers. You want your application running in summit running in per motor running or an amd a system like crusher different compilers, different runtimes, different programming environments.

A

We want to handle this complexity as well build systems. Simple codes can be compiled with a single invocation of the compiler, but real applications composed of thousands of directories and source code files need build systems. We need more portable and maintainable way of building our software by components, and this is where make files or cmake tools come into play, so we need to find a way to interact with these build systems as well and, of course, operating systems is different to deploy and develop in linux.

A

And, finally, you need to deliver your software for linux, windows, mac or any other operating systems in the target platforms- and this is about developing but remember, also measuring performance in real applications on real platforms is not straightforward.

A

We need some kind of methodology to benchmark correctly measure several measures discard the extremes, non-relevant values and make the average of of that and always guarantee correctness. So, in the end, all of these things are topics that we typically don't. We don't cover in this kind of courses, but in real applications. You need to consider which are the constraints and the requirements of your project.

A

So, in the end, you can be optimizing performance, but also with trade-offs, performance versus maintainability performance versus portability across applications, performance versus something so developing real applications with quality high quality is not easy, and sometimes we need to make trade-offs between one one objective like performance that can be contradictory or opposite to another, one like portability or readability, or make or maybe fight once with the other. So this is the complexity of real applications and real software.

A

So for that purpose, if you remember from the typical usages of for cc, plus plus and fortran developers, which is profile guided, we have focusing on the two first use cases number one get the performance, optimization report and number two get details about the specific actions and use the source code, rewriting capabilities to annotate your code with openmp and open ecc, but we have also designed other typical use cases, use case number three, four, five and six and, for instance, look at point number use case number.

A

Four interaction with compilers kangoian, reads: the performance, optimization report of kodi and compare what you can add with respect to the existing compiler. You can be compiling your code for nvidia, compiler and permuter, and maybe you want to run it with the in ibm compiler for summit or with the amd compiler for crasher. So you need integration with compilers also use case number five integration with build systems like cmake and make and finally benchmark the performance on the platform.

A

So for these use cases and others that are not mentioned here, like detailed information about memory access patterns about data scoping about um memory footprint of your application, we have more advanced uh capabilities in coding that we have not explored in this course and that are not listed here. But our purpose and our commitment is to for each of these use cases, provide you with a video that you can watch typically something with around three minute video, where you can see how to use the capabilities for that use case in kodi.

A

So you have a video for producing the performance.

A

You have another one to use actions and pw directives to annotate your code and also to unlock new opportunities, integration with compilers integration, with build systems and benchmark performance impact on the hardware platform, and we will be adding more use cases supported always by a video showing typically in three minutes how to use kodi for that purpose and this this deck.

A

Essentially, I want to remark the typical use cases and the videos that you have if you have any requests any typical use case that you don't see covered here, please reach out to nurse calls here for us directly, so we can guide you how to use it and even eventually produce a new video recording that can help you and others in the community. Okay, please feel free to do it at any time. We proactively will be addressing those two suggestions.