National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Performance and Portability

Description

Christopher Daley (LBNL), John Owens (UC Davis) and Phil Roth (Oak Ridge National Lab) present a Panel Discussion on Performance and Portability. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/
Due to some data loss, this recording is missing the start of the initial discussion. Panel Chair: Muaaz Awan.

A

I'm a big multi-gpu box they'd like you to throw down six figures and buy a dgx, um but they've gone to a huge amount of effort to try to give you good scalability within that. So it's no longer! You know if you're interested in the highest performance generally. What you want to look at today are those sort of boxes and it seems like the super computers are moving in that direction too.

A

Multiple gpus in the same box, but large investment toward trying to make them look like one gpu as much as they can um so in terms of new hardware capabilities. You know, there's a lot of things you can point to that, have been introduced to processors over time that have been here's. This general thing we'd like to do in software and we're adding hardware support to it or here's.

A

This general hardware thing we think, is going to apply to many different kinds of software and we're going to add that, but what's kind of unique in terms of what I'm seeing is deep, learning really being so important to the development of these new machines.

A

And so what it means is that the hardware is now offering really interesting mixed precision, arithmetic and that seems to have legs. And it's going to be there for a while, and if your applications can take advantage of this reduced precision, then I think that is an investment worth making.

A

I know there's going to be an interesting talk on their panel on that tomorrow.

A

um Only in the last few years has python really started stepping up in the gpu world, so nvidia has their rapids initiative towards doing data science and trying to do everything on the gpu uh there's a nice numpy implementation. That's been done by nvidia research published at super computing last year. That gives good numpy speedups on gpus, and so this also is a trend that I think is worth catching on to because it's a very nice environment to develop in python.

A

um And finally, the compilers uh you know llvm is this open source compiler, it's relatively straightforward to work with there's some new, just just-in-time kind of technologies for being able to to generate code just in time, and they give you a lot of opportunity for doing some interesting innovations on the software side.

A

uh Three things I want you not to do as you're developing gpu codes, um so the first one is do not think of the cpu as the main processor anymore. Now you want to think about this reverse accelerator kind of model, because if your code is bouncing back and forth from cpu to gpu to cpu to gpu you're having to move data a lot- and you really really don't want to move data that is expensive and the bandwidth between cpu and gpu is rather modest.

A

So as much as you can, if you're going to target the gpu, try to do everything on the gpu, and it makes it hard to do this if you start with a cpu app and you move over one kernel at a time, you know you want to think from the outset about how can I do everything on the gpu to minimize data movement?

A

The next thing is a lot of times. We look at how to port to gpu.

A

We think a lot about the computation and I'm going to suggest you at least equally want to think about data structures, because gpu performance is often limited by data access and the way that the gpu exercises its memory system and structures, its data, and so starting with that is often a better path to getting good performance eventually, and the final thing is, the memory harkey of the hierarchy of the gpu is is a little bit wonky, but at least familiar to people that have dealt with cpus before, but the compute hierarchy is generally pretty new.

A

This idea that we have this hierarchy from threads to warps, to blocks to gpus, possibly modules on gpus, to nodes to systems and there's levels of that that don't really exist in the cpu world, and so I think it's necessary that we consider that as we're developing our applications, because it's very very important for good performance and the real critical thing about developing for the gpu for doing high performance is how to balance between trying to expose parallelism.

A

But as much as you can limit the amount of work per parallel grain. And so, if you balance those things well, then you have what we call high occupancy that you're keeping the gpu busy all the time and it's necessary to keep the gpu busy, because it is uh absolutely critical that the gpu be able to hide long latencies from the memory system by having lots of work to be able to do, and this is something where profilers can help. But experience helps as well.

A

But it's just it's the most important thing in terms of getting high performance. Thank you.

B

Thank you. It was very insightful, so next we have uh phil from uh okay national lab. Can you please go ahead and share your.

C

C

I'm having to let.

B

I didn't did you, do you have the permission to share the slide.

C

Yes, I'm having to let zoom uh get permissions to.

C

B

D

Should I hear.

B

The slides and if we can go from.

A

B

Still, I have a copy of your slides, but it'd be okay. Fine here and you could walk us through was, I think he.

E

B

And then he got back in, I think it.

C

C

B

Let's try that.

C

Again share my screen and let's do that.

B

C

All right, sorry about that, you can take it off my a couple of minutes account, so my name's phil roth. I am a member of the oakridge leadership computing facility as of uh late 2018. Before that I was a member of the future technologies group at oakridge national lab. We focused on things like emerging technologies and gpus were one of the things our group worked on.

C

So I want to take my brief couple of slides and the first one just give you a an idea of what I'm interested in what I'm working on so that when we have our panel, if any of this stuff is resonates with you or things that you're concerned about that, perhaps then you can direct a question my way.

C

So the the the biggest aspect of my current position uh involves preparing codes or getting codes ready to run well on frontier, as you might expect our facilities next machine frontier.

C

There's lots of attention being paid right now into trying to to make something that will work well there, but we have a system on the floor right now summit, not the number one machine in the world anymore. I'm told it's it's number two as of last week and uh but it's still a very good development platform for frontier and, as you may know, the two systems use different kinds of or different vendors for their gpus and have slightly different node architectures, and so on that node diagram that I have there up on the right.

C

That is an updated frontier, node diagram. Earlier this year, cray now hpe has given us the permissions to give a little more detail about what the node organization is going to look like. So, if you're interested in what the frontier nodes are going to look like, I believe they also talked a little bit about what the nvidia-based shasta nodes are going to. Look like too.

C

There are a couple of lines that are drawn in different places, but the general architecture of of number of gpus to cpu or at least arrangement is similar.

C

I'm also very interested, as you might imagine, because of those couple of systems, and also it was made of the point earlier in in the talks today- that most people that are in the user base of one dua center are not necessarily limited to that doe center they're they're, trying to run on on systems that are at different centers, both within the doe, complex and outside, and so you know, one of one of my challenges in working with applications is to try to sometimes open people's minds to think about doing things in a way that's going to potentially port well to other systems, so that uh so that involves looking at things that you know lots of people call programming models.

C

I have a problem with the name. That's why I put it in quotes, but um one of our primary ones at oak ridge to be looking at for frontier is, is hip, which uh renee mentioned several times at this morning.

C

The the chart that you see to the right is a little bit of an eye chart, but what it shows is some normalized performance data from a benchmark suite comparing hip versions running on summit to the the cuda versions of the same benchmarks, running on summit and the hip performance was 99.8, measured to be 99.8 percent of the the cuda performance, but I'm also uh for both my own application development and for uh professional interest.

C

I'm interested in the you know, sequels plus based models where you specify your code in lambdas, um cocos, raja db, c, plus, plus and sigma.

C

So I loved um the the presentation that we just got a couple of the recommendations there hit really close to home. For me, what I want to bring out as recommendations is is to think about your code. This is to all the application developers think about your code kind of at a higher level and a little bit independent of the programming model that you think you're going to use to actually express your code.

C

I'm recommending that people try to think about their code as in terms of sequences or or dags of of array operations, because those are the kinds of things that are going to map well to the the current types of hardware, that's in gpus, so, for example, if your natural thought is in in fortran think about how would I express this as a bunch of array operations?

C

If you like to think in python, you know think about list comprehensions. If you are, you know a c plus plus user. Maybe you have to think about coco's raja if you have the experience, but um you might have to get acquainted with things like the algorithms library and then choose preferably something that's portable as a programming model, and I want to re reiterate the point that was made kind of more subtly earlier today about becoming conversant with the tools that the systems have.

C

So all the doe centers have training all the gpu vendors have have training materials available about using tools, sometimes the vendors themselves. Their tools are are more focused on a particular area like, for example, it might be the single node performance and then you have to go. Look for a a separate tool. That's going to give you a scale out picture.

C

There's some examples here on the slide and my experience, especially with those that are more academic uh scale out tools, don't be shy about giving them feedback, sometimes they're frustrated that they, um you know the projects are, are trying to do the best they can in terms of getting information.

C

Sometimes they don't have all the information, but they certainly would love to hear about people who are trying to use their codes, but are trying to use their tools but are running into barriers because of it, and uh that's what I have I think now we trans transfer into the um question and answer period right.

B

Yeah, thank you phil, uh so yeah. So now it's the start of the question and answer session or the discussion session. uh The audience can ask your question in a q, a box or if you want to uh be, you know, ask a question verbally, you can raise your hand and will unmute your might, can allow you to talk and if the panel members have questions with each other, that would also be appreciated.

B

So we have a question in the q, a box other than the single person developing hip signal. Are there any other efforts to support sickle, dbc plus plus on amd gpus?

B

I think still, you are better situated to answer this question.

C

So uh yeah, I was looking to see whether that question showed up in my chat and I'm not oh okay, there. It is.

B

It should be in q, a box.

C

Yeah, so uh what can I say about that? um As as jeff hammond talked about, there's some open source implementations um there there are.

C

There are discussions going on? That's the best that I think I can say in this venue for uh what might be possible. One of the things that I am interested in and play around with again is just professional personal research interest. Whatever, with an eye towards what you know, could we do is trying to see whether the open source, open source implementations of these things uh can be made to run on on systems that they weren't necessarily the primary target for so, for example, jeff hammond mentioned hipsicle. I think it was mentioned in this question.

C

I guess was last year I did some work to get hipsicle running on or at least demonstrate hipsicle running on summit, um which is a you know, interesting to be able to do that's not dpc, plus plus. So, if you're reliant on the dpc plus plus extensions, it's not going to give you that, but a basic sickle type operations.

C

You know, I think it's worth at least a try.

B

Any of our other panelists want to address the same question.

B

Okay, uh so there are like a couple of questions. Maybe three questions in the chat uh portion. Can all the panelists have a look at those starting from the first? That seems interesting.

B

The question is to to get portability what about using task programming and let runtime optimize data transfers. This allows to migrate to gpu implementations, one kernel at a time and possibly not actually migrate, some of them, if they're not seeing a concern.

A

Yeah, this is john. um I I so the the programming models I know for gpus are primarily data, parallel based, as opposed to task based, and so I think, there's interest in doing task programming, but the programming systems don't support that nearly as well as they would, if you did it in a data, parallel style. So I believe in theory, what you say is a good thing and there are runtimes that are supporting that to some extent, but they are not supported by the vendors as much as sort of the traditional way of doing things.

B

Okay, that's interesting. uh Anyone else wants to address the same question.

F

Yeah sure this is chris, um so this is supported to some extent by openmp runtimes um I mean you have the map clause. In order to perform the data transfers, then you can also specify dependencies between openmp target regions, and these um openmp target regions are effectively tasks.

F

um So in a way, if you could stack up these target regions and specify appropriate dependencies, you're kind of leaving it to the opmp runtime uh to do the optimizations for you.

B

All right uh phil, do you want to say something about.

C

Yeah, I was gonna sorry I was muted, um so I was just gonna say that you know that's one of the one of the arguments that the dbc plus plus and sickle folks make for their programming model is that the runtime is handling the data transfer you're, describing the you're describing the dependencies and the runtime is handling that. So in my um you know you can you can argue whether my experience is broad enough or not, but in my experience, there's there's different classes of people's perceptions on that.

C

Some people would much rather be in control of all the data movement and other people are happy to let a runtime try to automate that, and it's almost like a pendulum right. It feels like it swings back and forth where people think that a runtime can be smart enough and then they see that maybe it can't and then people go back to a more explicit data movement type.

B

Approach all right: uh we have another question in the q and a box. uh What version of hip was used for hip cuda comparison? I have noticed much much worse performance than than with cuda with version 3.5, especially with kernel, launch overhead and impact on cpu performance. I think this is more targeted towards uh phil.

C

Yeah so, and I was typing up a response which was part of the reason I was muted, um so uh one of the things that maybe didn't come out in in the the brief presentation there was uh both of the both of those versions of the code were running on the same system and with the way that hip works. No, so these benchmarks they're not using other than just very, very limited use of hip laws, they're, not using anything other than straight up hip.

C

So the the kernels were actually implemented in hip or implemented in cuda, as opposed to making calls to external libraries. So um essentially the way that works on a platform with nvidia gpus and which is the way it's going to work on perlmutter.

C

Is that um that that there's essentially like a translation that goes on at compile time, and you end up compiling your hip code with the nvcc compiler um at least that's the way that it was uh supported at the time that I did the the um the comparison experiments. Okay, so by the time that you've actually produced an executable. As far as the system and the runtime know, it's a cuda executable, so I didn't see the the problem that that charles is is bringing up.

C

I'm wondering whether charles, whether you're talking about the situation of comparing the two on two different systems, if you, if you want to you, can send me a private email, because I think you've got my email address to give more details. If you want to explore it further.

B

Okay, uh anyone uh uh like anyone else on our panel session, has a response to that or we should move on.

B

Okay, so we have a raised hand. So a guest wants to ask a question verbally, so I'm going to yield, yellow I'm going to unmute your mic, so you can go ahead and pose your question.

G

Oh, I actually don't have a question. I just have a comment, so I think, even from the cpu land. Nowadays we have the cmd, so this is kind of low level data parallel paradigm and then at a high level. We have the course so when uh naturally, this is quite similar to gpu, which you have a one, gpu multiple gpu.

G

On the other hand, you have a fat low level vector. So when we program in either systems, we have to think about how to utilize all the low level data parallel. Then, on the other hand, on the high level use all sorts of, we should consider task parallel type of things. The reason is, as this device gets fatter, but they start to get isolated and the internet.

G

The interconnect cannot follow up. So we have to use task parallelism to reduce the latency between all those components. So bear that in mind.

G

You do have to consider both in your program if, if your program can first leverage the data parallel, if your loop has a million iterations, do it and probably that will work pretty well for gpus uh for a while, but as vendors scaling up the chips, they even get more power, more hunger and you to want to strong scaling your problem to reduce the time to solution and that your single kernel could not feed the the v3 device. You do have take advantage of asynchronously launching multiple kernels. They get them to occupy the device.

G

So all those things are on the table, it depends on the needs. Then you map them to actually hardware provides to maximize the performance. That's my take on both the data parallel and task parallel.

B

Yeah, thank you for the comment. I think it's uh very interesting.

B

um Anyone wants to respond to that comment or have a counter comment.

B

All right, uh we have a few more questions in a q, a box. So the first question is: uh how does number compare to numpy on gpu? I'm guessing uh john owens uh pointed out some python uh accelerated libraries. Do you have a comment on this.

A

Right so I can. um I can point you to two things that I know, and I hope other people might comment and say: oh look at this place. Otherwise, so the numpy stuff, I know, is built on top of legion and it uh it's vegate leg ate and it was published at super computing last year. I'm not aware of that being a product, but uh the results they got were outstanding. It's really neat work, the number stuff.

A

I would look at rapids, so it's rapids.a.I and I would say, the python development that's been happening at nvidia has been focused within rapids and it's more on the data science side. But um they've done a lot of python work to be able to. You know, make gpu computing map onto that. So um it's certainly mentioned on their web pages number. It's specific numbers specifically.

B

Thank you, and uh next we have, uh I think, it's more of a comment. uh Handling data transfers by hand also means handling the gpu memory by hand. This can become hearing.

B

C

B

C

Yeah so yeah, so this is, I mean, I'm guessing. This was in response to my comment um and again I I really I really seem to well. I really feel like it's people fall in different categories. You can argue that that a person doesn't shouldn't need to pay attention to that level of detail, but I can point to examples of scientists that you know you can tell them hey, let let the runtime let the compiler, let the whatever handle this and they don't want to. They.

C

Don't trust that the soft you know the tool chain can actually handle that um and at some point you know you can sit there and you can try to show them hey the the performance is close enough, but if the person is dead set on not trusting that those things can stay abreast as the architecture changes or the underlying software changes, whatever you know and they've got access to the low-level pieces and they want to manage that complexity.

C

You know that it's their project right.

B

Okay, that makes sense so uh next we have uh an interesting question from uh kate clark. uh What do the panel think of c, plus plus 17 parallelism and full tron 2018? Do concurrent and evolutions of those going forward? Example: coco's, feeding into c plus 23, etc, uh is first class language evolution going to replace directives like openm, openmp and openacc, or language extensions like cuda sickle, perhaps saving extensions for specific optimizations as needed. I believe you can. The panelists can see the question.

A

So it could- and I I looking forward to your talk- um I I think it absolutely could replace that, but I think the question is: how quickly does that happen right? So you know, I mean one of the reasons that cuda has gained traction is that nvidia has been able to make changes to the language fairly quickly and bring in new hardware features and to do that in a standards.

A

Compliant way is just much much slower, so you know we kind of have this dual track development process, where, if you want the latest greatest stuff, you're kind of forced to use the vendor, but I think moving it into the language standards is an unadulterated good for the long term.

C

I think michael wolff actually wrote a blog post about that right was it about a year ago, maybe a little bit longer than that and- and that was essentially his point- you know directives his point.

C

I guess, as I took it from what I read and what I remember was that directives were essentially a stop gap in between getting something um eventually accepted into the the core language of whatever it was that whatever the programming language was, so I'm actually, I you know what I've seen of the c plus plus 17 parallelism, I think I'll admit I think more in in c plus, plus than I do any other language, um but so what I've seen there has been really exciting.

C

As someone who's a fan of things like cocos or or sickle or dpc, plus plus, um I think you know we. We need to see evidence that the implementations are are going to do well, um but I I trust that eventually they will I'm not worried about that.

F

Yeah these um base language editions are important, um but just looking at fortran, for example, um the fact is only providing a do concurrent abstraction. This is really not enough for users to implement all of the parallel algorithms. They want to implement.

F

Obviously, there's a need to do things like reductions and atomic operations and.

D

So it's just going to.

F

Take time before the base language can support all of the parallel constructs that we need, um and, as was mentioned earlier in order to use the latest and greatest lower level hardware features these proprietary languages or directives is the.

D

F

B

Okay, so we have uh lori steffi who would like to actually respond to one of the questions, so I'm going to uh enable you mike uh lori you're good to go.

D

Okay yeah, can you hear me yep? Okay, I just saw one question in chat uh and the speaker already addressed, but uh numpy versus number um yeah, the numpy that most uh python people are familiar with is not gonna work out of the box uh on a gpu, so you'll need something specific um and that might be kupay, which looks like uh numpy, but it's kuda under the hood or legate, which is what uh the speaker mentioned, uh which is legion under the hood. um So I I linked in the q a but yeah.

D

There are a lot of different python options for gpus and if you have questions you can follow up with me.

B

All right great thank.

A

You laurie lori, have you used either of these? Like? What's your experience, I this is great point. I'm really happy to hear that you you've played with these things.

D

uh Yes, I have um so I've. I've used number quite a bit uh and I like it, but it uh for most python people.

D

It might be a little bit of a shock because it looks more like cuda than python um and actually we're just now getting a collaboration going with the legate developers, so we're going to test that out on core gpu.

D

So I can report about that soon and uh yeah koopa is the easiest by far um so it's probably the best thing to get started with uh yeah. I'm happy to talk. You talk more about this.

B

Okay, great- and I think jeff hammond also has a few responses to one of the questions. uh It would be great if jess could we could take a answer verbally, if I just unmuted your mic, do you want to.

E

Oh okay, so sorry I I I was just saying you know for folks who are interested in usm and any other features just I. My email is easy to find use either gmail or intel intel's preferred, but gmail is easier to find. Just let me know uh and I'll and I'm I'd be really interested to know what features folks want to see and- and I can you know- introduce you to the implementers if, if that would be helpful.

B

That's great, thank you.

C

All right, I got one panelist to panelists two panelists to panelists. um First one uh to owen, you made the the comment about data structure today and I love the comment right. So I I can't tell you how frustrated I am with with one of the codes that I've been working with, where we um and by by we I mean me, because I was involved in the initial construction of this code.

C

We just didn't think about eventually running on gpus and our data structures were not not amenable once we started to try to bring it over. So what are what are your high level or even lower level suggestions beyond just saying use, rectangular array, data structures?

C

What are your recommendations on the data structure? Side.

A

Yeah they're I mean the toolbox is pretty limited. You know, arrays are a good way to go. um I think.

A

Sometimes you want to uh do things heterogeneously and that you want to. You know, make the common case fast and try to make that fully coalesced very good parallel access, even if it means you're doing um you know. Maybe things on the periphery differently, like uh that, is often a win to make things heterogeneous, and then you have different execution paths, maybe different kernels to handle those things, um but I think we have to start at the beginning when we design something and say: okay, how are we going to lay out the data?

A

How are we going to access the data? What is the um you know? What are those patterns and design the computation around it, uh and also to understand like how we're passing data from kernel to kernel now more and more, the the gpu memory hierarchy is allowing better and better. Caching like the caches on ampere are much larger than the ones on previous generations, and so historically, it hasn't been very effective to do.

A

Caching, to do uh latency reduction and more and more designing things that use the cash well is going to be a a good direction to go. uh The other thing is that user managed cache. The shared memory is what nvidia calls it that you have this little scratch pad of a few dozen kilobytes is really crucial like if you can keep your computation within uh a single kernel, then that is a huge win in terms of performance um and in terms of more sophisticated data structures.

A

There's this real chicken and egg kind of problem where people don't think in terms of using more sophisticated data structures, so they don't write programs that use them. So nobody has any use cases for which to design more sophisticated data structures, but I think the future is that we will have a larger toolbox of data structures, and I certainly am interested in hearing what those data structures should be from people that might be uh used to using them on other platforms.

C

Yeah, just to also emphasize that point that um you know was made about thinking of the the cpu as being the offload engine and the gpu being the core processor. You know. One of the statistics we like to throw around is that um on summit over well is 90. Some percent of the capability comes from the gpus and on frontier, it's going to be over 95 percent of the capability of the system.

C

So if you can't use the gpus just not going to end up running well on that kind of system, which probably means you're not going to end up running on that system,.

A

Right and that's just you know I we we do whatever we can to try to keep things on the gpu. You know our group does a large graph analytics package and it is more expensive.

A

It is less expensive to do a breadth-first search on a large graph than it is to send that graph from cpu to gpu, which is kind of boggling right like a bfs. Is it's not the most complicated thing, but you've got to touch everything in the graph, but it's still faster than sending it from cpu to gpu.

B

So there has been a lot of talk about performance and portability and for some time I've also seen the mention of productivity with these two uh p's. uh So how do the panelists think that productivity will like be factored in in in the near future, because we have this uh these model, like roofline model for quantifying performance on different devices and portability, to some extent we can know, but how do you factor in like productivity?

B

How do you determine if a certain code needs to be? You know we need to like to optimize it? Is it worth spending all that time in or we could just use some directive based and get like a minimal speed up.

C

Well, all right I'll jump in there um so and I hate I'm, I'm saying something about back in the old days, but uh darpa had a a program called what was it called uh perks, p-e-r-c-s and one of the things they tried to do was come up with the definition of productivity and and my recollection was there were some interesting attempts, but nobody actually felt like they did a good job of defining productivity.

C

So, yes, I know about the p3hpc efforts because I'm involved in them on the organizing side, and but I I struggle a little bit even the portability side right. So we have a couple of people that have have come out with metrics, like john pennycook from intel has defined a metric for what performance portability might mean, simon mcintosh smith, who had some exposure in one of the earlier presentations from his work. They they have.

C

You know, tried to take some workloads and apply that metric, but I don't think, there's you know even close to global acceptance about what even a portability metric is much less a productivity metric.

C

So you know the ideal thing would be that we would have models that would show us whether it's worth doing the effort to do that. You know that porting effort that you just talked about, but I don't think those models exist and right now I haven't seen anything that I think is so promising that we're gonna get there soon.

C

I think it's our.

B

Research yeah john go ahead.

A

I just I mean it's: it's a near impossible problem to solve to get both performance and portability uh I mean, if you you know, if I was to look back at the opencl effort.

A

um You know if I felt like there was a lot of emphasis on having correctness portability like you could run something on another piece of hardware and it would run correctly, but there was just not really an emphasis from the committee on how do we build a programming system that will allow us to have high performance across the range and maybe that's impossible, but it just wasn't a focus, and you know it's always a tension between those things between portability and performance, and you know I will be looking back in another panel in 10 years and see if the current efforts have struck the right balance there or not.

F

I would just say um for productivity: I think one measure is where you can develop the code. um If you consider something like directives, you could develop your application with directives on any platform on a machine, even without a gpu um so being able to develop the code anywhere. That's obviously a productivity win, um and also for things like um directive based programming.

F

You do have um a longer term kind of maintainability win where the compiler can generate code, that's kind of appropriate to the target hardware. At that time, it's less down to the programmer to to tune their code for a specific piece of hardware with the directives you can get defer, control to the compiler and allow the compiler to better map it to the hardware. At that time,.

C

So I'm gonna it's a panel right, we're supposed to be contentious.

C

I'm gonna argue that those nice features that you mentioned are not limited to directives, they're, not specific to directives, because I think about you know something like coco raja. That's got different back end, implementations and, and those features you just talked about. You know you could say that um if you write to that coco, raja interface um and and sickle and dpc plus plus fall in the same category right there, they're acting as an abstraction layer and the back end implementation is the thing that's going to determine your actual performance.

F

B

G

I agree: yeah, it's not limited.

F

To directives, um I'm thinking more in terms of proprietary languages, I know, for example, in my early days of trying to use cuda, it was hard just to get access on a machine that had nvidia gpus for me to experiment with kudo um there you go.

A

That comes up every year at the nvidia data science summit, that everybody says I want a laptop that has cuda on it and um everybody has macs in the room, and so that's it's frustrating. I I agree productivity wise. It's really nice to be able to work on your laptop.

C

Well, that's part of the reason why uh again you know thinking back to that one code I was talking about where we're struggling with the data structure, one of the things that we have adopted there is we're using cocos for that code, and so we can be doing development on laptops all the way up to the big hpc machines with different performance and and recognizing that there are problems that you see when you're targeting you know actual gpu hardware that don't necessarily show up if you're, using openmp on a cpu as the back end.

C

um I recognize that's happened. I've gotten bit by that, but but it's it's better than you know not having anything to develop on your local machine and always being dependent on being able to access the big machines.

A

So one of the enormous strengths that I've had from being a cuda developer- and my group has had- is that it just has so many other cuda developers and so we're often able to leverage work. Other people have done, and so as an academic uh cocos is not something that's had. uh You know, in my view, an enormous impact outside the doe.

A

I I like its focus and what it's trying to do. How do you enlarge that footprint? um Because that would you know if you had a big presence outside the doe? It would only help people inside the doe.

B

Thank you uh very much for your point of view, so I think the discussion is still ongoing. Maybe we can continue this during the break in the breakout room. uh I would like to thank once again uh chris john and philip, for participating very actively with all the energy in this uh session and to all the audience who who participated in this.