National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Current State of LLVM Compiler

Description

Johannes Doerfert (HPE)
Current State of LLVM Compiler

A

Thank you, um I'm, going to give a quick overview about lvm openmp, mostly uh I, will touch on uh support for Cuda and hip and I have a bonus material in the end for some questions, I'm trying to focus on application developers here and um if you are interested in what I'm going to say or you're like you know, okay I want to know more.

A

This is actually going to be a kind of a a short version of a three hour tutorial I gave that I want um that really goes into details about um tips and tricks for application developers. What to do in what situations? What cool features are there and so on so forth? So we I give you the cliff notes here, but that link, which is for the until the end of the year for i1 participants, will get you to uh the three hour tutorial with lots of details.

A

um So it might be a little too long, but it's certainly detailed, okay, cool. Let's start if you are looking at llvm llvm client, probably is is what you would start nowadays um and you wanted to. You want to build it yourself, because you wanna, you know the newest features, the newest bug fixes or you want to play with it.

A

um A single cmake command usually suffices to get llvn um from the GitHub and be able to do GPU offloading with it. So this this cmake command will probably get you a client compiler that can do Cuda, offloading, openmp, offloading and so on and so forth. um There's a lot of uh useful flags that are described on these web pages down there, and this is the first out of a few slides that you might want to just screenshot instead of me going over all of it.

A

um Those are you know like information for later. In case you, you know you need it, you can go back to it. You find the resources, take a screenshot if you like it.

A

um Similarly, this is my cheat sheet for just using technically any c plus compiler. um It always show you a fast Linker use things like CC, cache and and an alternative to make. That is faster, consider, link time, optimization and we'll get to that later. um uh There's a lot of good Tooling in the C plus plus World C world that you should consider um always make sure you get the right, optimization flags and the right architecture set and then on the llbm specifics site here. The online documentation is often not that bad.

A

So there's a lot of documentation to a lot of topics there. We have a lot of sanitizers that are really useful, and uh if you built your own llvm, a release in a search build is probably sufficient for you, rather than a full debug build okay, this one is openmp specific, llvm, openmp specific um again cheat sheet. So take a screenshot of this um we'll get into details of some of these as we go along so I will not talk about them. Now. Take a screenshot and we'll go on all right. So first things.

A

First, if you have any questions about llbm about llbm openmp, any other subtopic go and ask the community I mean llbm is is mainly Community Driven. So there is no. You know there is a lot of companies behind it, but the companies have different interests and so on so forth, and we all come together in this community and we have various ways for you to get involved, but also to answer your questions go and ask your questions. It's free. There is like a forum mailing list which is called discourse. There is a persistent Discord chat.

A

That is an IRC chat. They are sync up meetings on various subtopics, including ALS analysis, mlir, machine learning, openmp and so on so forth, there's even office hours nowadays, where you can talk to experts and get ask them questions um so really make use of this like there's a lot of a lot of information out there.

A

But you can also ask your questions and they will be usually be answered now, looking at the latest release, which is lvm 15, which was just released um very sh like we're looking at the first month after release, so um there's a couple of new changes that are especially interesting for the openmp offloading uh folks, but also for the GPU, Cuda and hip support.

A

So for one we have a new offloading driver, so the thing that kind of orchestrates all the offloading it is used by default for openmp, but for Cuda and hip. You have to enable it with a runtime flag.

A

um The in addition to this new driver or with this new driver we add a lot of support one is we get multi-architecture binaries, so you can have a single library or executable that will run on all kinds of platforms that will run on amds on Nvidia platforms that will run on. You know an sm30 and an sm80 and a gfx 908. So it will you can bundle all of them into one library or less static library or all of them into one executable, and it will just work.

A

um We have link time optimization in while llvm had that for longest time. We now have link time optimization for device code. So if you go to the GPU, you can now enable lto on the GPU side only if you use the new driver. So if you use openmp with the new driver or Cuda with the new driver, you can do link time optimization for your device code and a lot of times that actually gener like gives you quite a performance boost.

A

Even if all your code is in a single translation unit, so you really want to consider turning it on and we're probably going to turn it on by default.

A

um As I said, static, Library support is now in there all ways of generating a static library that has device code in it and then using it should work if there is a way that doesn't work. Let us know um openmp with the new driver becomes interoperable with Cuda and hip, which we'll see later a little bit, and we have a lot of additional Flags in that release to improve performance in in situations where the compiler cannot argue what is correct uh or what what static? What transformation is correct?

A

You can help the compiler, um so I always have this community efforts like with a lot of names from a lot of Institutions. So you see this is you know me giving a talk, but the work is done by a lot of people and um there's probably more people that and I should update this slide to just to give you an idea on the top left corner. You see a link to our weekly meeting. uh We meet once a once a week for llvm openmp and feel free to join.

A

These feel free to join any meetings and they're all in this llvm calendar. Okay, um I mentioned this before, but there's various ways to get involved. We have um right now we have an FAQ for the lbm openingp webpage on telegram openmp webpage. We have these meetings where we're like 25 people-ish every week. uh It's mostly GPU Focus, but you get also the people from all the vendor companies such that you know.

A

If you bring up an issue there, we might come up with a solution that works across all of the compilers, all of the vendors, which is really good.

A

um We have office hours, we have the discourse openmp category and we have a slack and a weekly meeting for openmp specific optimization. So if there's something you know you, you want to see the following: optimization done or the following feature: prototyped in llvm. That might be a good place to to find people to help you to do so um and now, as a kind of case study of people that actually reached out and worked with us on their code.

A

So they came to us and said: okay, we are really interested in using llvm openmp, but our cook, our performance, is really bad. So we looked at their code, and um so you don't have to read all of this. The the main idea here is on the right. The highlighted numbers, so all of the numbers that are in the uh green oval are speed up numbers from the different optimizations that we performed while working with them. We, like the first thing we did.

A

We made a cmake Unity, build which effectively is copy together all the files into one big file and then compile it. We don't need to do that anymore, because we now have device site, LinkedIn, optimization device set lto which allows us to not do this, but still get the same benefit. But it's much faster and then we help them with determining what uh what optimization Flags they can use and then we help them like.

A

We worked with them and found a performance bug in llvm, and then we helped them with improving their code and at the end of the day, you know if you, if you multiply all these factors together the speed up that they got just from you know, working together with us on the compiler application. Development was like a 100x or something like that. So it's it's really. You know it adds up and we're usually really happy to work with app teams as well in close collaboration to make their code faster and learn in the process.

A

What optimizations were missing, where we are like having um insufficiencies in our runtime and in the compilation and so on and so forth? So so this was really a good experience and if people are interested, you really should you know reach out um so shout out to John and the openmc team who did and I hope they had a good time as we did.

A

Okay, um there's a lot going on in the in the openmp ecosystem, especially like in lovm. So this is just you know, uh an excerpt of all the scientific papers that are written with everything from tooling to optimizations.

A

um Not all of it is, you know, goes back into llvm. So if you download lvm, you don't get all of this, but a lot of it um feel free to take a picture in case you ever like. You know want to know more about things, especially if you're you know in a scientific world, but this shows you I'm, showing this mainly to convince you that there is a lot of you know, research efforts going in.

A

So if you have fun and interesting problems in your code or you want to you know, figure out or you have an idea for a better way to make it fast or offload it. um There's probably people out there interested in working with you.

A

Okay, let's look at some. You know end user, end user uh improvements, so for one nowadays, if you turn on minus G or minus G line tables minus the line tables, only you get um information about where a crash happens. If that one happens, it tells you to visit this web page that has more debugging options and so on and so forth. So the the error messages that come out of openmp offload is are much better than before.

A

At the same time, if you use, if you go to this webpage and you used it the what is explained there to do debugging you get these um environment variable Flags, so lip balm, Target info is is highlighted. Here is one of them that allows you to see what the compiler did with your code. So how did the compiler execute your kernels, your target regions? What data is copied and when uh and why and how did the compiler map you know implicit arguments and so on and so forth?

A

So this really helps you to understand what is going on under the hood um and you can get various kinds of information at a like fine granularity. At the same time, we have a debug mode, which is um which you have to enable at compiler time and at runtime together, and once you enable it at both times. You get these um debug checks in baked into your program and these debug checks.

A

Do you know they print out if a certain like if certain weird situations arise, for example, here they print out whenever malloc returns, an all pointer that can happen, especially on an AMD GPU, but also in a Nvidia GPU.

A

If you run out of Heap memory and if nalok returns, another pointer oftentimes things go bad, so the so here in this example, it did and uh one way of solving this would now be to use an environment variable that we expose to increase your HEAP memory, and then this error message would disappear and things will probably run.

A

Oh yeah um I mentioned this before we have these assumptions to help the compiler to do better optimizations of your program to because you don't really know what the compiler needs. We also have remarks so optimization remarks. You can turn them on with the commands in the upper left corner here. So our past basically turns on just generic remarks. Our plasma turns on things about missed, optimizations and Analysis about analysis results that we collected throughout and then it these remarks also tell you what to do about them. Like is this a good thing?

A

Is this a bad thing on? How do you kind of interact and act upon this? Now and usually you have like three different ways you can write. Prakma OMP assumes you can write, attributes assume or you can use command line Flags, and we have. um We have standard assumptions in the openmp specification. We have llvm specific assumptions that we that we recognize and we have by now at least.

A

You know uh five I think by now more than that command line flags that that provide assumptions to the compiler one more thing is we even have you know, environment variables that you can use to improve performance if you do not need certain guarantees in all of these are kind of explained on our lovm.openmp sorry openmp.lvm.org webpage, where we explain all the environment variables and so on and all the remarks and so on. So let's look at an example. This is, you know, Classic vectorization, on the left.

A

You have some code, you compile it with clang and you say: hey give me the loop, vectorize analysis report and it gives you this remark that says you know we couldn't vectorize the loop, because floating Point operations are not reorder like we can't reorder them and then you can put a pragma of Simply there and then it will vectorize the code and everything will be fine.

A

Similarly, you have this. You have this device code here with you know, device declare Target, and then you run the analysis. This time you run openmp opt instead of the loop vectorizer remarks and what it tells you is.

A

Oh, we have this Global um globalized variables and um we they are globalized because they are potentially captured, and then it tells you you can use this attribute No Escape to overwrite this, and it gives you this this code at the end, if you look at the lowest row at the end, it says omp113 and if you go to the web page, all of these diagnostic numbers are have their own um page.

A

That has oftentimes examples, descriptions of the problem or why it is a good thing and what it means and how to how to work with this. So we're really trying to you know, provide a good user experience here.

A

um There's also more information about how to record these remarks, how to filter them based on your uh hotness in your code, so you can see only remarks that are, you know relevant to Performance, and this also helped with regards to how to show remarks inside of your source code rather than on the command line and so on and so forth. So there's a lot of tooling there that you can utilize now, I mentioned this briefly before we have multi-architecture binaries.

A

So when you now nowadays, if you download a client, you can say um offload architecture, and then you put the sub architecture there and openmp is going to do all the rest for you. So if you look at the example here in the in the the second command clang, um you don't have the First, Command and the second command are equivalent.

A

But you used to write the First Command, which was like long and cumbersome, and now you can just say: okay, openmp and my offload architectures are the following sub architectures, and then we will build it for all these sub architectures and we'll embed everything together.

A

um We can you can use lrvm object, Dam to analyze a an object file or uh to determine what kind of images are in there. In this case you see we have the gfx 90A and the sm80 image in there and now, if you link against this, you can either link against both or one of them or none of them.

A

um I mentioned the link time. Optimization really, really. This helps you to get better performance, even if you only have a single translation unit- and it usually isn't going to make your code that much slower to compile, because we only do it for the device code, so minus F, offload, lto, minus F, offload minus lto turns it on, and um you have to run it both at the compile and at the Linker. So this is kind of a little odd for now, but this is right now.

A

What it is here are some some results on what happens if you turn on lto and you see on the left, these benchmarks, like XS bench, su3 and I- think even mini MDOC are single translation units, so they don't benefit from the multi-translation unit, optimization, while on the right, the openmc and thermo4pfm benefit from multi-translation unit uh uh multi-translation unit lto, so they really get you better performance because they optimize across all of your files.

A

So when you offload your gpus um static, libraries are now fully supported, so no matter how you build your static libraries, it should just work. You can also you know again analyze them and see. What is what off-road images are in there um and then, if you, you can use this to build static libraries that only contain device code, for example.

A

So and then you can basically ship static libraries that are only for the device and if you use lto so the link time optimization, there is no overhead in you know: bundling your device code in static libraries and then merging it with your kernels later on. If you do not use lto, you would have these call edges and other overheads that that you might not want, and as you see here when you do the offload art um command, you can have a lot of sub architecture.

A

So you can compile this for a whole bunch of architectures and then say Okay um embed. All of these sub-architectures in this uh in this object file, make it a library and later on I, just Link in that library, for all the support.

A

um Now, with the new driver, so here, if you see the minus minus offload new driver option and the fgpu RDC option, when you turn them on, you can actually link together, openmp and Cuda code or open a p and hip code so which is which is great, because now you know you can mix and match your program. You can. You can use uh Cuda libraries that you find on internet or Cuda codes, and you can call them from your openmp code technically also vice versa.

A

So here is an example: just where we do this, we compile a Cuda file, we compile an openmp file and then we link them together and then we can call both in the same program. Okay,.

A

If you're only interested in device code, you can um look at the you can use this flag to look at the device code. So offload device only gives you the lvm IR for the device code, only it's really only using for it's really useful for debugging.

A

However, there is also the minus F openmp offload mandatory flag, which effectively disables the host fallback in openmp offload and basically says: okay I only create Target regions for the device, which is sometimes really useful, especially if you have you know, Cuda functions or so on that only exist on the device, and you want to call them and you're not interested in ever running your target regions on the host. So this is so you don't have to create.

A

You know: host Alternatives or host fallback codes for device only code- oh yeah, um we'll we'll skip this now. A couple of uh small tools that are that are interesting to people. So when you build lrvm or if it's installed, yes, there's a binary called lvm openmp device info, which, if you run it, will tell you what lovem knows about your your devices on your system.

A

So the first four devices are going to be CPU devices and then no matter how many CPUs you have that's just an artifact for now and then afterwards, it will show all the gpus in what we know about the gpus. So you know about like what kind of compute capabilities you have What GPU. It is memory, size and so on and so forth. So this gives you an idea if lovm actually is able to see your gpus and also gives you an idea about what the GPU properties are.

A

You can turn on target profiling with llbm since 12 by just setting an environment. Variable lip On, Target underscore profile and what you get out of. There is a Json file that you can load into Chrome tracing or a lot of other tools that support the Chrome tracing format, and that gives you you know um a trace, a very simple trace of what is happening and with llbm 16.

A

So the next release we're actually going to give you the capability to either automatically or manually profile parts of your kernel code so of your device code and embed that into this profile as well. So this was not going to be as good as you know, these big profiling libraries, but it gives you something that is, that comes with llvm by default and if you just want to do some local profiling or, like you know, some quick checks. This is really useful.

A

Okay, because I'm running out of time, um quick summary um with llvm openmp, you can offload to remote gpus if you want, if you want to so this is upstream and available and currently we're looking at adding an MPI backend as well. So if you want to program only openmp and you want to, but you program, multiple gpus or technically, even one GPU- you can utilize Hardware, that is in the cloud or on on distributed systems uh purely with openmp. So on the user Level side there is no difference whatsoever.

A

um It also scales uh reasonably well. So this is an example of XS bench scaling to 120 gpus uh with moderate overheads.

A

um One more thing I wanted to show is uh what is currently being you know, merged into lovm, and that is using llvm openmp as a Target intermediate runtime.

A

um What it allows us to do when I show just show results here is this is a Cuda code, a plain Cuda code, su3 from from nurse that we run through the openmp layer through the virtual openmp GPU Target on the host, and then we attach GDP to it, and then we debug this Cuda code on the host with GDB.

A

You can use all the host tooling, but it's the original Cuda code and it kind of runs in the same. You know GPU fashion. Similarly, you can take um on the list. You can take Cuda code and you can compile Cuda code with clang um onto other Hardwares, not only the host, but also you can take that Cuda code and run it on AMD Hardware, and it's not only about Cuda code.

A

But what we're showing here is that openmp gives us this target independent runtime, and we can, if we use it, we can merge and interoperate Cuda, openmp, hip, sickle and so on and so forth. So that you can mix and match in your application, whatever you like best and behind the scenes, everything will work together and be portable as well, which is great and um the results here. Long story short we generally perform just as well as the native programming language in the native compiler.

A

Okay. um The last thing I briefly mentioned, is what we are presenting at the SC workshop for the lbm HPC Workshop in a few weeks. um The classic idea is your program runs on the CPU and you orchestrate your data and computation movement in kernels onto the GPU, which is what we're doing for years right.

A

So you find things that you can offload effectively what we're looking right now, what we're looking at right now is: let's reverse this: let's run the entire program on a GPU and only you know go back to the host if we have to for like CIS calls and and certain libraries for which we don't have the the device implementation.

A

um As part of this work, we literally take entire programs, no modification, and we run them on the GPU and while the the paper which you see at llvmhpc is going to show really bad performance numbers, but it shows proof of concept. Our news performance numbers look pretty good so, and this effectively gets rid of all the porting whatsoever. We just run the entire thing on the GPU, which also works in an environment where you do not have unified sheet memory by the way which is kind of fun.

A

Okay, um I'll leave it with that, because I want to have some some time for questions and the slides will be available where you can see the the recap and Outlook. Thank you.

B

Thank you, Johannes uh thanks a lot for agreeing to give the talk. uh We have one question um and so I'm sorry I've lost my chat right.

A

Question is: is lvm open to P working with mlir polygast? In short, are you seeing benefits from the new IR and the tooling within the openmp group?

A

um So mlir polygiist is a front end for everyone that doesn't know that takes the client AST, including openmp, and translate it through like into mlir representation and then from there into the lrvm representation.

A

um You can use that to to port code, funnily enough, including Cuda code, to host openmp um that works, and there is currently no driver- and you know, infrastructure, to use mlir to do openmp offloading.

A

um You can just get a subset of the host openmp, but people are working on this and how about the benefits um we see benefits if you do these, you know fancy things like take Cuda code and translate it into host code where you do complex transformations in order to make it to run fast, but we're not actually like. We haven't explored anything beyond that. So there isn't. No any benefits in. You know just simple um same device: same Target optimizations.

A

um Those only exist in llvmir itself in the openmp optimization path.

A

um On the last topic on running the program on a GPU and going to the host only when needed, what is the state of the art doing I O directly from the GPU meaning without having to use explicit device telescopics and using device storage, libraries, okay, cool um as far as I can as far as I know, the state of the art is there. There is no production working solution or even research solution that I'm aware of that.

A

Does it but I'm actually trying to get someone to work on it, because there was um some Nvidia folks have shown. You know proof of concept that you can do I owe from the GPU and I would really like to have an open library to do that and then provide effectively things like f, open, F, read and so on on the device through through direct communication with the kernel rather than RPC sysc codes and so on.

A

This is also part of our effort that we're currently uh really engaged in is getting lib C and lip C, plus plus to run on the GPU or at least in large parts.

A

I hope that kind of answered your question, but thank you. Thank you. Yeah. If you have any more information on this feel free to reach out I mean I'm super interested in this.