National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Programming platforms: Kokkos, Raja and OpenACC

Description

Rahul Gayatri (LBNL), Sunita Chandrasekaran (University of Delaware) and David Alexander Beckingsale (LLNL) present a panel discussion on Programming platforms: Kokkos, Raja and OpenACC. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Panel Chair: Dossay Oryspayev

A

Welcome everyone for our first panel for today day two and today's panels. Topic is programming models and it will be mainly about caucus, roger open, acc and open amp I'll, be the moderator of this session. My name is dosse or spire.

A

Okay, so for this panel session we have three panelists rahul kumar gayatri from nurse lbnl, and we have sunita chandrasekharan university of delaware and we have david alexander beckingsale from lla, llnl, okay and just to remind us all of us, including myself, about some of the ground rules that we have for panel sessions.

A

We'll have two parts main two parts first part will be presentations. Many short presentations of about five to seven minutes by each of our panelists. Then we'll have about slightly more than 30 minutes for discussion and q and a session and then all of the audience. I encourage to actively ask your questions and participate in the discussions.

A

You can do this either using the q and a box or you can raise your hand, and I will unmute your mic so that you can directly ask question to one of our panelists of or all of them. Okay, that being said, I would like to first introduce rahul kumar gayatri. He is a application performance specialist in the apg group at nurse where he works mainly on helping application teams optimize their codes for the next generation architectures.

A

He was a postdoc in the same group prior to joining staff. Did his graduation from barcelona super computing center, where he worked with the omps's programming model group? Okay, uh the whole uh floor is all yours. uh Thanks, jose uh hello.

B

Everyone uh can everybody see my screen.

A

B

Rahul, okay, thank you hi, um so I'll just put on.

A

The captions, as required, as was.

B

Earlier requested, okay, so today I'll be talking about uh openmp and openmp4 gpus in the context of this workshop right and the coco's programming model. So let me start by brief introduction about what I wanted to talk about. Openmp, uh I'm, assuming that most of us are pretty familiar with openmp as a framework and in. If you want to know more about openmp4 gpus. I would encourage you to check out chris daley's talk from yesterday's performance and put productivity and portability a panel.

B

I think he gave an excellent uh summary of the available open mp offload features and the compilers that support this currently and what their status is and performance in terms of different benchmarks, micro benchmarks that he tested.

B

So I just wanted to talk about how the openmp is currently important for in the doe space, for the upcoming architectures right, we have nvidia gpus or that will be a part of the perlmutter machine coming up at nurse and we have clang ibm and pgi, compilers and gcc, all of which either already support the openmp offload features for nvidia gpus or are planning to support them uh soon in the future.

B

Whenever full motor is ready uh for production, uh then there are intel gpus that will be available on arora and for which intel compilers plan to support the openmp offload features for those gpus and the amd gpus on frontier, for which cray and amd both the compilers plan to support these features for those gpus.

B

So in a sense we have, uh we can think of openmp as like this portable framework that can uh take an existing code with written with those with the offload directives and can be run across all the next generations, uh supercomputers that will be in the doe space.

B

I know the title says performance portability, and I don't want to get into that discussion because, as we saw in yesterday's panel, it has like multiple viewpoints and there is no single accepted definition of what performance portability is of yet, but at least you can be sure that the there will be some sort of portability when you use these openmp for offload directives across these multiple architectures, how performant they'll be and how close to the peak performance of the architecture they will reach. That's yet to be seen.

B

But the good thing about this is- and in my experience on working with these uh directives and with different compilers, is that the compiler developers are really receptive to what to any bugs that might have or any feature requests that you may have.

B

They have open forums where you can submit these bugs and most of the times. It is generally something wrong that the programmer is doing rather than the compiler. But if there is a case when there is a genuine compiler bug, they are very diligent in in fixing that and releasing it with the next compiler release.

B

So, in that sense, it's a very active community uh in terms of users and in terms of compiler developers, uh some of the advantages and disadvantages of using openmp uh offload directors is that first thing is, as we all know, is open. Av is relatively easy to use. It's not too invasive into your code, like you can just annotate blocks of code, saying this has to uh this loop.

B

The subsequent loop has to run in parallel and when you, uh when these annotated directives have like the target, keyword in them is effect openmp uh they will be offloaded to whatever target accelerator is available right and then it has support for c c, plus plus fortran languages, which is quite widely used in our community, and there is also active work in having some sort of an implementation for these offload directives.

B

On the cpus in case, there is no accelerator available uh where the code is being run, so there is uh portability uh at least that's the plan uh going forward.

B

The biggest drawbacks that I have seen as even chris pointed out yesterday, was that it can that you have to really simplify the code uh when passing it to the openmp compilers to get optimal performance, uh and especially, this is very true in case of c plus plus, where uh even some sort of an advanced like templated meta programming, uh which is supported by c c plus 11 and which even the current specs says that c, plus plus 11 is supported, is the base language.

B

uh But still you don't see really performant as performance results, as you would, if you had uh really hand tuned each of the template parameters right and it requires most of these best gpu programming practices, like you know, uh doing the column, major access for uh cordless memory, access and everything by hand, uh rather than depending on the framework to to get this.

B

For you in that sense, cocos is a much more advanced framework where it allows you to to have final grain control over the code, as well as doing a lot of these performance tricks uh in in in the back end by itself. So for those of you who do not know what cocos is, it is uh a c plus plus based uh programming uh model for writing portable app performance for double applications.

B

It allows you to expose abstract hierarchies of parallelism in your code and then it is the job of the framework to map these hierarchies onto whatever target architecture. You are running and it it provides a portability, some sort of portability across all major hpc platforms and the way it does. This is by supporting different backends. So the backends that are already available are the serial backend, which is basically sequential code. The p thread back end the openmp 3.0 back-end cuda back-end coda uem back-end.

B

What this means is that, for each of these back-ends, the cocos constructs will be mapped to the respective back-end.

B

The fact that there is already a open, mp3 and p p thread back-end implies that the cocos code would already run on almost most of the available cpus for hpc that are there and the uh cuda and cuda uvm back ends implies that you can. You can run the cocos code on all the nvidia gpus that are available right and then, apart from these backends, there's an active development of openmp target backend.

B

uh The openmp target that we were just talking about, which implies that the cocos co uh constructs would actually um be mapped onto the openmp offload directives uh by the framework, and then there is a hip back and that is being worked on for the amd gpus and the signal back in for the intel gpus right.

B

So in a sense, the fact that there are already either available or being actively developed backends means that for the future, like the frontier, aurora and pearlmatter machines, you will have at least two back-ends per architecture to offload your code that you have written with, cocos constructs and you can pick and choose whichever it works the best for you right.

B

uh What are some of its highlights, as I mentioned earlier, it allows you to abstract away the execution and memory spaces like it. It will let you allocate a particular storage uh in a particular memory space and execute a particular given block of code in a particular memory space.

B

The way it abstracts the memory space is that it uses these views, which class view classes, which is basically an abstraction for multi-dimensional arrays, so that lets you choose your memory layout like uh like what type of data layout you want like whether you want a row major or a column, major uh storage, whether and what type of, uh and where do you want, this uh default uh storage to happen on which memory space you want a default storage to happen.

A

We have about a minute.

B

How many one minute? Oh I'm, okay, okay, and then it lets you do this 1d or multi-dimensional. uh You know loop computations, where it lets you tile the loops and it divides the available parallelism into three different hierarchies which in cuda you can simulate uh you can analogous, which is analogous to like teams that unlock this to thread, blocks threads to thread x, dot uh y and vector to thread x, dot x and allows you to do atomic operations.

B

This is a simple uh example of uh of where uh of how a simple, c, plus plus code can be written in cocos. On the left side, you can see the c plus plus code, where we have an array, integer array of ten elements and it's being updated inside of for loop and on the cocoa side. You can actually choose your execution space. You know the coco's default execution.

B

Space will actually give you the default execution space based on the backend that you actually choose during compilation and the view then gets allocated on that execution or that execution space and then a parallel for would actually uh run this lambda on that default execution space.

B

Then this code will now run without any change on opening, on opening, with open mp3 back in on cpus or cuda back and on gpus and the hip back end whenever it comes and sql back and whenever we have uh start working on it, and so you don't need to do anything else about for portability right see.

B

I was just talking about all the advantages. The only drawback it has is that, as far as I can tell is that it only supports c plus plus, and the user has to have some amount of working knowledge of c plus, plus to start working with google's right. That's it.

A

Thank you rahul, and our next talk will be given by sunita chandrasekharan. She is an assistant professor with the department of computer and information sciences at the university of delaware. She received her phd in 2012 on tools and algorithms for high-level algorithm mapping to fpgas from the school of computer science and engineering ntu singapore rich her research spans high performance computing, parallel programming, benchmarking and data science. Applications of interest include scientific domains such as plasma physics, biophysics, solar physics and bioinformatics sunita.

A

The floor is all yours.

C

Thank you jose. um I hope you all can see the slides.

A

C

Okay, thank you, everybody for joining, and um you know being able to do this online, so yeah. So following up with uh rahul's previous, um um you know talk on the openmp and cocos. This is about open acc, where the idea is again. It's a directive based programming model, one of the two directive-based drawing model, the other one being openmp um and just skimming through some of the ongoing uh things that we have been up to with open acc.

C

This was um open, acc 3.0 specification, which was announced at sc last year, and there is a link to on the slide that will take you to. You know more elaborate updates on what were the new features added to 3.0. We have started to work closely with the base.

C

Languages um to you know, be able to support some of the important features, and this is still conversation ongoing um on the side within the technical, open, acc technical committee um to be able to update base languages uh to c18 c plus plus 17 for 2018, um with all the you know, motor motivation to define behavior for c plus, plus lambdas added in c plus, plus 14..

C

um There is definitely many more things to do with respect to supporting other features in these languages, so this is pretty much a start. Then we also improved multi-device support through direct memory, copies and synchronization as another added feature. Prior to this, um there was, if in case, if you wanted to copy data from one gpu to another, we had to copy to cpu and then to gpu. So by enabling this improved multi-device support, um we won't need to synchronize back and forth with the cpu. We won't need to block the cpu twice in case.

C

We want to move data or synchronize with two gpus, for example.

C

um Similarly, zero on create two data clause and expanding list of directives that can support if class, some more clarifications and cleanup based on user feedback and as when we go through the specification, as all of you might have experienced, you know you always want to update and clean up based on the feedback that you have received while developing the features as well as using the features.

C

um I also wanted to step through some of the um activities with respect to using openacc on scientific applications. um Some of these data are not 2020 are not super current but, for example, the 18 of insight applications at summit are from data is, from november 2019 um the top five hpc intersect. 360 research is a couple of years old number, but gaussian vasp ansys fluent are some of those hpc top five top three applications platform supported.

C

That is a list there, but I would also you know, plus one to rahul's previous comment where you know we could get into the performance portability aspect and talk about it for several hours and we will never come to a consensus, but um so that those are a list of targets um that is uh that currently open acc supports um and open acc applications. As you can see, the trend has grown from um you know about 30 applications.

C

All the way through 200 above and the applications that um are you know, worked on at hackathons, obviously also count with the increase. The number of different types of domain science applications that open acc is able to support.

C

uh We are also running an openacc slack channel, which has grown over the past couple of years, quite a bit, especially those participating in gpu hackathons. We invite them to the slack um channel where they have basically used as a stack overflow if you like, um and we have been debating between you, know, keeping slack and moving to stack overflow, but bottom line is trying to answer questions as of when the users have them and an easy mode of communication to you know, get them up to speed.

C

uh Pgi community edition is the compiler that pgi, where you can download pgi, open acc, the latest version being 1910, that is the october 2019 uh license licensed, but free to use of a pgi compiler.

C

Ever since this was released and made you know available, you can see the total number of downloads have steadily increased and I myself use them as part of my teaching. um You know parallel computing or computer architecture courses, just like you would use say. Gcc openmp, for example, for you know, parallel computing classes.

C

This is an ongoing effort, a bit more on the gpu hackathons. There are a couple of varieties that any of you could participate, boot camps and hackathons. One is a couple of days event, the other one is a five-day event um and- and you basically bring your code, you bring your team, um there are mentors assigned to your team, it's two mentors per team and um just sit down, and you know and hack code and get it to working on the systems um which the particular hackathon host supports or whatever.

C

So this has, you know, been very successful and you can see the plots as to how the number of participants number of courts have been steadily increasing. I myself did one of them in ud, uh back in 2016, had six teams um and definitely the codes took off, and it was a nice way of exposing domain scientists to the different ways you can program either. You know accelerate and parallelize their code on large-scale systems um and uh thanks to julia, we got this.

C

I got the slide from her just literally just yesterday, where we have the bunch of hackathons coming up. So take a look at the gpu hackathons.org and you would find you know a range of applications and it's not just open acc. It's cuda open, acc, openmp cocos. There are applications, there are even sometimes python.

C

um So if you're interested you have a code you're looking for help and mentorship to move your code to large scale systems, you have more than one place to go to to participate, and several of these hackathons are going on virtual right now, and I hear it is pretty successful. So do make use of you know: gpu porting and the last one I wanted to draw your attention to is an open, acc teaching kit which uh people like me.

C

You know who teach and I'm sure there are several of you on the call who probably educate and teach you know next generation workforce, um there's a bunch of teaching materials that I work with nvidia to put together and there is darkerization as well available. Google slides available and some codes for lab exercises, and these are some of the modules there's room for improvement. But there's something for you to. You know start off and get your hands dirty.

C

A

Thank you sunita, and thank you for staying within the time. Okay, our next programming model uh talk will be given by david alexander beckinsale. He is a computer scientist in the center for applied scientific computing at lawrence, livermore national laboratory. His work focuses on programming abstractions and he is a project lead for empire and chai, and the core raja team member david received his phd computer science from the university of warwick uk in june 2015.

A

When he is not programming, he can usually be found. Trails, running skiing and mountain biking david floor is all yours.

D

All right great, can everyone hear me and see the slides.

A

D

All right, perfect, okay, so good morning, everyone I'm, as you might have guessed from the title, I'm going to be talking about raja.

D

So if you're familiar with raj, you'll you'll know this, but it's a library of c plus abstractions that allows you to write single source, portable loop, kernels and so the key idea here and it's you know similar to what cocus does- is to be able to really insulate the application source code from any hardware or programming model details, and it gives you an extra layer of abstraction on top of some of the the other kind of portable programming models uh like openmp, so raja supports a wide range of application needs.

D

You know we started with simple loops and we've extended that to support complex kernels with non-perfectly nested loops. We have portable reductions, scans and atomics and we provide multi-dimensional data views.

D

So in terms of back ends, we've made a real good progress, pretty much supporting all the current platforms as well as being well underway to supporting the machines. You'll be expected to run on that are coming up soon, so we have regular sequential loops some simile stuff openmp support both for the cpu and for target offload with openmp 4.5.

D

We have a partial uh threading building blocks back end.

D

We have cuda and hip support so that you know that gives you support for both nvidia and amd gpus right now, and we have a sickle back end so really targeting intel's kind of one api extensions, that's under development, so one other thing that we've kind of focused on and with our users is the ability to kind of optimize and tune by parameterizing, based on your execution policy types and that'll kind of they'll be a bit clear when you see a code example right now, so roger development has really been driven by the requirements and constraints of massive production codes at lawrence livermore right.

D

So the initial goal was to get these big old codes running efficiently on sierra and at the time this porting effort was started. You know a complete cuda rewrite was never going to fly because it's not portable and when you have a million lines of code you just can't afford to maintain multiple versions and when the, when these efforts were started, openmp offload uh wasn't really viable, and I think this was mentioned before, but the you're really heavily dependent on compiler support.

D

If you want to use something like openmp, and so that's you know, wasn't a route that our application customers really wanted to go down and then, if you look out kind of into the future platforms that are coming down, the pipeline are based on gpus from different vendors, so really having some abstraction that insulates you from this is changing. Technology is critical.

D

So in terms of what raja looks like at the top here, we've got just uh you know your standard c loop. We introduce a few concepts that allow you to write this in a portable way. So the first thing is um the kind of the execution template. So here we have for all which is our kind of simple loop api.

D

Then that's templated on an execution policy which determines where this code is going to be executed instead of passing in your uh kind of loop bounds, as you know, just a beginning and end, we provide these iteration space objects that allow you to kind of describe what you're going to be iterating over and here in this case, that's the range segment, which is just a contiguous range of indices, and then the final piece is that, instead of you know, writing your loop body just in there.

D

You turn that into a lambda expression, and these um the numbers here in the bottom right are just comparing the the speed ups of um this this loop. You can write it obviously like natively in sequential openmp or cuda, or use the raja version, where the only thing that changes is the template parameter, and you can see there then you're taking the time of the native implementation over the time of the roger we're pretty close to um to parity across those back ends.

D

So the next api is the kernel api, and this is how we describe you know those more complex um like multiple levels of loops, as well as um non-tightly nested ones, and it's it's parameterized kind of.

D

In the same way you have the kernel function, that's templated on an execution policy, but instead of a single iteration space and a single lambda function, you can actually pass in an arbitrary number of iteration spaces and an arbitrary number of lambda functions, and then the execution policy describes and how those are iterated over and the kind of the order that you want things done. So this.

E

D

We support things like um loop.

E

D

And based on the execution policy, so we don't provide a memory model by design, so all so the template parameters.

D

You know in the previous code examples they just determine where that loop is going to run and the data accessibility is the responsibility of the programmer and again this goes back to kind of the initial work working with the the code teams at livermore, where they already had code to manage their data in the way they wanted to right and we weren't going to come in and tell them that they have to to move everything to some special. You know raja managed data type.

D

So the the two projects that that I work on that are in this space um is chai, which provides an array-like object that coordinates with raja, so that the data moves back and forth to the cpu and gpu implicitly, depending on where your raja loop is going to run, and then we also have the umpire project, which is a portable api for accessing different types of memory. So you know umpire is to kind of memory resources.

D

What raja is to execution resources.

D

So really what raja has given us in this? You know having a an abstraction on top of these various other programming model technologies is to be able to write code. That's high performance on sierra, but that we know is going to be portable to future platforms right, we're insulating all the the application developers, the computational physicists from all this kind of underlying churn on on programming models right. You know where you've got cuda and hip and sickle.

D

That's really not something that you want. The application developers to have to deal with right and one of the things that we've really focused on is making it easy to do the most common use cases. So incrementally adopting this, you know one loop at a time as you gradually port. Your application is easy and was a critical feature in moving applications right onto sierra, but through some of our apis.

D

Like the kernel api, you are still able to express complex, computations and map them very specifically to the underlying programming model, but all that kind of specific information is kept in the execution policy, which is just the template, parameter that your application source code can remain the same.

D

So I know historically, um people have viewed the raja project as as fairly livermore focus, but we really are making an effort to collaborate and continue to to develop in the open and to onboard more external users. So you know I just closed with saying that we we really welcome users and collaborations, so we hope to hear from from some of you thanks very much.

A

Thank you, david and I'd like to thank all of our panelists on behalf of everyone for presenting this very nice overviews of the different programming models and joining being able to join us during this uneasy times, and with that I'd like to start taking questions from the attendees or any other questions feel free. I mean the panelists. You can also ask, please feel free to discuss some topics with other panelists as well.

A

So if we have any questions attendees, you can type your questions in q and a box or you can raise your hand and we'll unmute. You.

A

Do we have any questions? Okay, so looks like we have a bit of the first question: okay, so the first question is coming from one of our organizers mwazawan. He is asking a question in this soup of languages, where each claims of ease of use and portability, if someone who is new to gpu porting, might get overwhelmed.

A

Where would you suggest such a person to begin? I think this applies to all of our panelists, so please feel free to answer it.

D

Yeah this is this is david.

D

I would, I would say that all of us are probably gonna suggest uh the projects that we're affiliated with, and but I mean, I think, that you've kind of seen across these projects that they, I think the important thing is that you pick something that is going to be portable right and then the second kind of thing to consider would be what fits best with your um your application that you want to port right, because if it's fortran, then cocos and raja are kind of ruled out.

D

Unless you really want to work hard, recoding, all your kernels into c plus, so that you can call them from the fortran code.

C

Sunita here chiming in um plus one to what david just said uh coming from a standpoint, you know teaching literally teaching students in class about gpu reporting. I think my another question would the person who is uh who is posing this question is: what is the person's background right? Are we talking about somebody who is a beginner to gp reporting or somebody who's in an advanced stage or an intermediate stage? That will also depend to where you want to start for the to the students.

C

I would. I would rather throw a directive than cuda, for example, um and I can see them starting to use gpus at the end of three hour, three three seven um three months of a semester, and they did they did projects. We did a very basic open, mpr flooring and they did projects on open, acc gpu and they were happy to have used gpus. So I think it would also depend on you know what kind of background the person is coming from um in order to use a particular framework to begin gpu protein.

B

A

B

Hi this is rahul. I I agree with what sunita and david said. That uh first thing is: what is the base language that your code is in, and second thing is: how much time do you have to spend on that like cocos and raja? Both are a bit more uh intensive in the sense that you will have to spend a slightly longer time to just use.

B

uh Just get your code ready to start using these uh frameworks, then something like open, ec or openmp, where they're a bit more easier, initially, but then again, cocos and raja support, more back-ends than openvcc or opening people. So so that's that's the choice like how much time do you have to work on this.

A

Okay yeah. Thank you for answering this question and let's get start, we have several questions in the line. So, let's get started with the easy ones. So uh vincent says that he is sorry that he has missed some of the last talk. What hardware back ends can raja target. This is, uh I believe,.

A

D

Sorry about that I was I'm muted myself. Okay,.

A

And the question was what hardware backhand scanner target.

D

Yeah yeah, I got it okay, so we have a sequential, which is you know, just your standard loops. We have um a way to kind of force the compiler to generate sim decode. Then we have openmp on the cpu and on the target with openmp 4.5 offload.

D

We have a threading building blocks back end that has partial support. We have full support for cuda and for hip and we have a development uh back end for sickle targeting the intel gpus.

A

Okay, thank you and uh if we okay, so we have several other questions lined up, so uh the quickest ones would be to answer what is the main difference between open, acc and openmp anonymous attendee is asking.

C

um So uh so the open, acc and openmp difference. That's an excellent question.

C

Rahul, do you want to answer that first.

B

Okay, apart from the obvious yeah uh fact that openmp was like until a couple of years, back openmp was more sort of concentrated in in on the cpu side, sorry and open ecc more on the gpu side, but now, with the with the rapid development of uh these target directives by different compiler vendors for different hardware, uh I would say like at least for nvidia gpus. Both open mp and open acc uh kind of are pretty much similar, but uh open ecc has like especially the pgi implementation.

B

Open ec has a has been around for a longer time, so you in some cases you might find that it is uh more optimized in its implementation compared to openmp, because mp is still a bit new uh with respect to that. But openmp is also, but from my experience both of them can be used uh to achieve the same sort of performance.

B

uh Apart from that in the future, I think- and I I don't I hope sunita doesn't get angry at me and saying this, but in the future I think there is more hardware support for openmp offload directives compared to open ecc.

B

As I was talking in the presentation.

C

Thank you rahul. I just I. I thought I should let you say first, because obviously openmp has been existing for a longer much longer time than uh sorry. Openmp has been existing for a much longer time than opengcc right. So um obviously open mp has been uh you know, prevalent for cpu for many years, so there are concepts like tasks in openmp where you could probably do the same with openacc, but it's a little bit more convoluted.

C

So if you were to have a you know application you want to break down with respect to tasks, I would use openmp um with respect to gpu programming. Openmp offloaded offloading compilers are evolving as we speak. We know they are a priority for many different reasons um and there are codes beginning to exist. When I say codes I mean more than benchmarks, you know I mean real quotes um in comparison to open acc, which has been existing since 2011 2012 onwards.

C

Implementations began to exist so predominantly targeting gpus, and so the adaptation and usability of open, acc features and implementations for gpus are are more. You know readily available for large cores for production codes as well. um Openmp is playing catch up and pretty sure they'll get there. But if you want to you know, use um move your 100, I mean tens and thousands of lines of code to gpu open acc has been there done that. So um that's my two cents.

A

Okay, thank you next question that we'll take is uh how do caucus and raja compare and differ. They seem very similar.

D

I guess I'll go first, then. Oh.

B

Sorry I I was just trying to organize my thoughts.

D

No worries, I'm I I mean that's a good point right from that, certainly on the surface right, they provide similar functionality and they have the same kind of ideas right and that's largely because they're developed in kind of similar places right where, where the focus is getting these massive codes moved on to machines with gpus, which was a was a huge lift right for us.

D

I think some of the the main differences are the first kind of philosophical writing, like I, I kind of was getting at it in my slides, where we tried to make it the simple thing: simple right, easy to to put raja just on one loop, and you don't have to change any of your data structures or your memory management code right. We don't. We don't want to take control of that, and I think the other place where there are some differences are just some of the kind of specific features.

D

So I think the kind of the stuff we have in the kernel api in terms of what you can do with um nested loop patterns and how you can map them specifically to various parts of the underlying programming model. Back-End, are kind of distinct to raja and potentially, following on from that, the way that we um express execution policies- um it's not in terms of like this is a parallel loop.

D

It's like I want you to map this loop specifically to you, know uh threads on the gpu or blocks on the gpu, or something like that. So we really expose that control to the user through the interface.

A

Okay, uh can we uh okay so because of the time, let's continue with the next question, which is along this similar lines? uh Jack is asking a question on similar to questions that came up yesterday. Do david and rahul see a path towards getting the c plus plus standard to adopt some of the ideas in caucus and raja.

B

I would say yes, so I a lot of the um advanced features that coco is, is that is available with cocos uh are actually being adopted in c plus 23 standards, uh and I sometimes you can you can imagine that uh scope was a sort of a pre uh like a testing bed for these features and the the things that actually work out well and then the features that will actually be beneficial to this language standard as such are actually being actively uh debated uh within the programming uh within the sorry language committees to get them adopted in the c plus plus standards.

D

Yeah- and I would just add to that real quick- we have uh an ongoing collaboration between the rogers and cocos teams to try and come up with some of these uh features. So the thing we're working on right now is uh portable atomics. That would work across. You know all these different hardware back ends, but have the the kind of semantics in the api of what will be in the c plus standard.

A

Okay, thank you rahul and david and next question is. uh We are not just programming kernels. Could each of you comment how ready those programming models handle asynchronous, launching computation transfer and handle dependency instead of users calling vendor runtime apis to manage.

B

B

As far as I know, glucose as of now uh so when you say asynchronous launching, do you mean uh launching a like a block of code and then just going ahead and doing other work or what is here.

B

Or the data transfer is in cocos at least is a blocking. No.

E

I mean I I mean uh I think coco is kernel launching or independent, but on the other hand we have to manage the dependencies using some way like events or synchronized streams or all those things uh is there a.

E

I don't want to manage it to directly calling the vendor api. So I is there like a task management at the caucus level or at the roger level to handle this for us.

E

I I know that openp have this no weight, then you can use open b tasks to manage that. But I'd like to know about the c plus plus programming model, and what's the current situation.

B

Goku's does have like tasks in it, where you can do this.

E

No, that task, if I remember correctly initially, was for a different purpose.

B

Yes, it was for a different purpose. uh You might be able to do this with the slightly, but it's not exactly as straightforward as I think your your question, I mean, as you are imagining with respect to this questionnaire.

E

Yeah these developers making effort to abstract and provide a portable way to facilitate this.

D

Yeah, so we have some stuff in development uh right now. Actually, where and it's it's taken us a while to to figure out the api to this, but basically yeah you'll have a kind of some portable uh object. That represents your you know: cuda stream, for example, and we're currently kind of figuring out what the kernels will return, but it will probably be some kind of uh handle to an event, but it's generic, so it basically kind of gives you what you say right you. You still have events, but it's not a cuda event anymore.

D

It's just this generic event that you can, then you know wait on or you can pass it around to to other parts of the the api. That's yeah, like I say it's it's in development at the moment. Certainly.

B

On the thing with google, as a the similar concept is being developed in coco's 2012.

E

Does your effort coordinates as well like cpu tasks, I mean, have a con coherent environment handling both hosts and device uh asynchronous, not just purely rely on cuda runtime to do uh the the tasking on the device side I mean, but because you not just have the gtp, you also have the host side activities as well as asynchronous io or asynchronous communication right.

D

Yeah, so I mean one: the one of the motivating use cases for what we're developing is your kind of communication loop right where you've got mpi stuff going on, and you want to be dispatching messages as kernels are finishing, so it's not something that we have working right now, but it's certainly it's a use case. That's driving this development.

E

Thanks for the effort.

C

um With openacc, we do offer its async launching of kernels, and there are underlying acc runtime apis um to be able to, you know, manage um under the hood different. There are different types of runtime apis um that we have used.

C

I have um matt is on zoom, sorry to put you on spot matt, but feel free to matt calgary chime in. If you want to add more to this.

F

Sorry I was half paid attention, so I apologize I didn't know I was going to be on the spot. No opennc does have the ability to do asynchronous communication either uh through the the compute kernels can be launched. Asynchronously to the host or the data movement can be launched asynchronously. So the host continues as the data movement progresses. So it's all fairly.

F

uh You do have to add an extra clause to your directives by default it will block, but otherwise it's it's in in inherent in the uh the programming model. The api to allow for that.

E

That's a very limited fashion right. You can only do streams, pass and string arguments. Could you solve dependency? Could you do events.

F

It does have dependencies, so, yes, you can do um so. It's really.

F

Yes, some implementations will use a stream if you're doing cuda to do the dependencies, but it's really an async number and you can apply dependencies via the um you. You can actually do an async weight where you do a weight or to me a acc weight on one str once a sync cue to another and you can create a whole different dependency graphs. Based on that, so you could have a one compute region which waits on strip on async cube one and two, but then launched as three.

F

So three won't launch until one and two are complete. So, yes, you can actually do very complex dependency graphs. I haven't seen a lot of people actually utilize that feature, but it is there.

A

Okay, thanks matt, and it's great to have you here. I've just been lurking. Sorry, oh! No! No! No thank.

F

A

Thanks, that's nice! So uh let's continue so we have uh about eight minutes till the end of our session, so lots of good questions and some of the questions are open-ended and and I'd like to remind the audience that we have breakout rooms and hopefully our panelists and other attendee members will be able to join the breakout rooms for further discussions.

A

But before we end this session, I'd like to ask the final question to all of our panelists, so please feel free to take turn to answer each of these parts or so that you give all other panelists a chance to answer. So my question is so what are the lessons you as a panelist as an expert that are advocating a certain programming model, have learned from history of the race of shared memory? Programming models?

A

That's the first question so to give you a hint. So there is a learning curve. There are academia and community at large that are taking an adoption and not the least support that comes with it. Finally, there are domain scientists who also need to adapt, take adoption of those programming models. For example, what you have learned from feedback that you obtained via downloads, compiler inclusions publications?

A

Also, please mention about the concept of swiss army knife or finding this nice good balance in between. So please go ahead and.

C

um I can chime in so it's a very good question. Thank you for asking this. um Having worked with several domain scientists, you know belonging to many different domains. um One thing we have learned, as part of my you know group here in ud, is profiling. Profiling helps immensely things that you have. You know thought that you have optimized and moved beautifully to.

C

Gpus may not be the case so for us, the kuda, occupancy, calculators, pg, prof, mvprop and now insight and looking into the register pressure and looking into what kind of data is moving around between caches between the different hierarchies of memory has helped us immensely to incrementally improve performance.

C

um This is with respect to open acc directives um and so profiling and re reprofiling. You know going back and fixing the optimizations um and there are tools like pcast, which I was mentioning in the call yesterday, which allows you to look into the accuracy or verifications of the ports between cpu and gpu. That also has helped us with respect to you know important domain science codes where accuracy matters.

C

um What else did that answer? Your question.

A

uh Sunita, can you also make a comment on what are the lessons you have learned before even adopting or advocating the open acc with regards to the shared memory programming models that have gotten into the race and have dropped out of it?.

C

ah So yeah we do talk about this quite often within the committee, um and I think I was trying to answer this on the chat channel. um I believe directors are there for a ride. um You know eventually. I would think that- and this is my personal opinion- would think that things will move towards space languages.

C

So um there are things. Directives cannot do, which is why programming models like cuda, for example, or you know even base languages- are doing their fair share. So there are things, directives cannot do um and- and you know instead of trying to fix that- I think it would be ideal to you know, get the best of all these different worlds and put them together and that that would probably be the base language going forward, and probably directives won't be there. Ten years from now, who knows.

A

Okay, thank you sunita and david rahul.

A

Please, who wants to go first, david, okay,.

B

I'll go okay,.

D

Yeah, you should have picked.

E

B

Okay, should I go.

D

Yeah go ahead: okay,.

B

So uh could you just repeat the question once more.

A

B

That I can help you organize my thoughts.

A

Okay, so the question general question was that what are the lessons learned from history of the race of shared memory, programming models in the history? So most probably before, advocating something you have looked back into the history and have seen this race for shared memory programming models? There were things that were introduced as it is happening now, and what are the lessons that you have learned and what were the things that were driving you to advocate the things or the specifically the programming models that you are currently trying to push forward.

B

You only so I used openmp quite extensively before I started using cocoas right and- and I agree with uh with sunita and saying that, like these directive based programming, models have a have a limit in what they can. They allow you to expose, and that's one of the reasons why uh I I I like cocos, is that it is it is. It allows me to express uh the underlying parallelism in in a much finer and better way, along with uh uh allowing me to to use its uh it's.

B

Like a view selected, you know the data multi-dimensional data storage classes and uh provide uh layout options, and things like that which which, when you do with openmp, you would actually manually do this this kind of thing and uh and then you have to change it every time you you actually go from cpus to gpus and or vice versa, but uh using it uh some sort of in uh more involved frameworks just go closer. I'm assuming I've never worked with raja, but I'm assuming.

B

It also has the same thing where it allows you to do this more fine-grained control and exposing of parallelism uh more than what a directive based programming model can actually allow you, uh but there is a catch right, it's not as easy to move to like coco, raja as it is with open empire of nisi steve, which is their strong suit.

B

It's more easy to start programming on a gpu with opencc, as against nathan mentioned in the previous question. If it's a three-month project that you want to do then open acc, your opening is probably the way to go, but you might not be able to expose everything or get as optimal performance as you can, with with like a bit more involved having models yeah. Does that answer your question.

A

Just say thank you, so the the next one david. uh Let's have your answer and we are ending towards. I mean we're reaching towards the end of our time.

D

Yeah sure so that was uh yeah super kind of in-depth question. I mean digging back into my memory bank. I really my first experience with this kind of stuff was programming opencl right, so it's portable, that's great, but the one thing that really sticks in my mind is, is all the boilerplate right and then kind of I moved to cuda and yeah. You get less boilerplate, but you can only run your code on on one vendor's hardware.

D

So really, I think that it's clear that some kind of portable approach, whether or not that's you know some c plus plus base thing uh like coco, raja or a directive based approach- you know- is- is really going to benefit you in terms of the the life time of your application and first for a more kind of practical recommendation. One thing that that we've seen is incredibly useful is with you know, with something like raja or cocos.

D

You can write your code and then, depending on the back end, you select you can target the uh gpu or the cpu. So what a lot of our code teams have done is if they run into a bug on the gpu they'll, take exactly the same code, but then have it be run like with the openmp threaded backend on the cpu and then that lev lets you leverage all of the kind of debugging.

D

um You know and correctness tools that you have on the cpu to track down bugs, and then you can go ahead and rebuild your code for the gpu and and oftentimes. You know the the tools you have access to on the cpu are better for addressing these kinds of issues. So I think you know any technology you choose being able to run exactly the same code in multiple uh locations can really help in terms of debugging. Any problems that you run into.

A

Okay, great, uh thank you, david rahul and sunita one more time I mean. I think this discussion was, I mean all of our audience were very active during these discussions.

A

Instead, we have lots of interests, so hopefully this race I mean will be all of us- will be witnessing this race of programming models and how do they end up at the last point? So with this I'd like to end this session,.