Cloud Native Computing Foundation Research End User Group, 1 Dec 2021

Previous Meeting

⏯

youtube image

►

From YouTube: CNCF Research End User Group: HPC:HTC End User Landscape

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Then press recorded it's recording all right so anyway, any of us can do it right. Okay, yeah.

B

And you always can.

C

A

It all right, so we can start. I guess uh so. Yeah welcome everyone. um Today's session is the first one with this new platform, so we got a few people here, so maybe we lost some on the way, but uh hopefully not many.

A

um The topic for today is this um hpc and htc uh end user landscape review. We sent a form two weeks ago with a couple questions, and this should be hopefully triggering some of the discussion. I see some people don't have microphones available so hi alex.

A

Can you hear me I'm a little quiet.

A

Can you hear me well jamie.

D

I can hear you fine yeah.

E

D

I think we're fine other people were just muted or if they're, uh uh okay, just me too, okay.

A

Can you guys hear me.

D

Everyone just yeah, okay, so yeah. Thank you see. We have a few teething problems on the new platform. Probably none of us have used it in anger before.

A

All right, just it keeps telling me my internet connection, is unstable.

D

Yeah just keep going it does that to me as well. I don't think I don't think it's serious right.

A

So yeah, so the topic today was this hpc htc and user landscape jamie. We put some questions, it's kind of more to trigger discussion. Today we can go through the replies and maybe stop on every topic and discuss a bit and uh hopefully like one thing that would be. Nice is to kind of come up with some next steps of what is needed in this area for the cloud native tooling and what would be really useful for people.

A

So if there's no other thing to start with, we can just start by going through the survey.

D

Yeah, I guess, uh make sure people put their names on the agenda as well like normal on the. um Actually. We don't need that anymore, because we'll get.

A

The reports automatically from the platform- oh.

D

Really I thought we said we're still going to do the google doc. That's fine, okay, so no names.

A

I think we can keep the agenda for the notes, but I don't think we need to collect the attendees anymore. Fine.

D

So we just need to say just ask if there's any new joins. I mean there aren't this time, but that that's true yeah, I don't think there's anyone joining for the first time. No, no. I reckon.

F

Ricardo will there be a way for the rest of us to see who did attend each of these meetings? If we, if we want to follow up with anyone.

A

That's a good point. I don't know about car. Is the attendees list in the sessions public to everyone? No, it's not public. Okay.

A

uh So do we want to keep the agenda then, with the names can also do that.

F

I think it might still be helpful all right.

A

So that would be here.

A

G

A

All right, so maybe we just go through yeah everyone, if you can add your names there and we can start by going through the through the questionnaire. I think we can stop at each one and uh if anyone has anything to highlight, um we can discuss in detail.

A

So the first question was what kind of solutions people are using for high performance computing or high throughput computing and other batch like workloads. So in total we actually got eight responses, which is not too bad. I would say because it was kind of a long question, so the top two were slurm and pure kubernetes, which is a pretty.

A

I was a bit surprised. Actually, um then we had two for hd condor uh and then we had one for armada, none for volcano. I had added it just because uh it's like a native kubernetes, scheduler or cloud native scheduler. There was one for code flow which is interesting and then ancient torque system.

A

So I think I don't know if anyone has any particular comments.

D

On this well one thing that strikes me, so people are clearly using more than one thing as well, uh so that's not yeah, we've only you know got five five people in stone and vanilla they're also not unique, um so yeah. That's that's kind of interesting.

A

Maybe maybe a question here is for the ones that, uh if anyone here has answered with more than one, is that like from a transition to something new or is it a plan to maintain both in parallel.

D

We're an example of people uh doing both uh for a transition.

A

So, and that would be like condor to ramana yeah.

D

Well, and also for.

H

The kidnaps actually but yeah, so I was wondering too, you know, are we not? I would be surprised if there's really nobody using volcano and armada and other things like that right so, like you know, maybe maybe it's also kind of a call like hmm like. Maybe we need to put out feelers into those uh user groups and try to get them more involved with this sig right, because people who are doing who are actually using that would certainly have overlap in what we're doing.

H

I think I wonder if there's ways we could reach out to those groups or even coop flow too. I mean you know. Those are I feel like those are. I don't know too much about armada and volcano, but I feel like those are. Those are non-zero user communities right now, torque, I'm not really interested in reaching out to that community. I'm just kidding, but.

A

All right but yeah, I think that's that's a good point actually and uh for volcano. They. They have quite a good structure of like weekly meetings. I think they're, mostly targeting or at least they're mostly have end users in asia. For now and those are weekly meetings and then they have like uh every two weeks. They have a meeting that is kind of europe and north america friendly one.

E

Other note on the volcano thing, um depending on the group, they will not be able to access google docs or submit to google forms um so like if trying to like that, might segment off a you know other community there, but they.

A

Can use this platform right because I've had calls with them and they can only use zoom, but I think they will also be able to use this baby.

E

Yes right, um but if, uh if creating a form for a global audience, uh surveymonkey will work in um in china.

A

Okay, that's good! Oh okay! I see what yeah, but in terms of reaching out, maybe maybe that's a good point. We we can take as an action just to to advertise this group in these communities to see if they are interested in joining it. It's not like. We cover every time this kind of work uh topic either like there's a lot of things that they will probably not be so interested in.

D

We need to find more people are doing ml and data science and stuff like that. I think.

A

All right, the other thing I don't know if, uh like slurn, has a pretty strong presence here.

A

Maybe maybe it would be nice to to know more of how people are using slurm and if they are deploying it on kubernetes or managing it with kubernetes. What what are their plans there as well.

A

Don't know if anyone wants to expand on that.

D

Yeah, do we have anyone on the call who is currently using snow.

E

I can I can speak for my old job um yeah. They were using both a mix of slurm and kubernetes. um There wasn't any real transition plan to go from one to the other, um but that's also was uh the university of michigan um and a good chunk of their users were very familiar with you know: slurm. They didn't really want to like interrupt their workflow.

E

The biggest thing was all the the newer researchers and people coming on board that were interested in using things like keep flow using things like jupiter notebooks, so that was kind of the you know. Separation. There.

H

Yeah I mean uh or nl. Obviously um you heavily uh using slurm lsf, um I put them all kind of together right tour. Slam lsf. I guess you know twerks kind of on the outs. I guess these days, but you know pbs all those right. um You know a lot of times. You know, certainly for us right, the the vendor that we buy the super computer from at the scale that they're building right. You know like they're gonna, it's gonna come it came with lsf right. We bought an ibm machine.

H

It came with lsf, so you know those sorts of things I don't think are going away for that traditional hpc community kind of like what bob's talking about right and so for us, it was. The name of the game is like how do we? How do we bridge that gap as much as possible? How can people use the slurm commands s batch from inside of a container that sort of thing? So that's the direction that we've gone but yeah? I don't. I don't see those getting supplanted. I mean you know.

H

They've got like 30 years of like industry, research being poured into those batch schedulers they're not going to like go away overnight right. So yeah yeah.

A

That makes sense.

A

So any anything else in this topic.

E

uh I guess for what it's worth, I do see more people looking to transition, or at least support running both um largely just because it's a lot honestly easier, especially these days um to get up and going in kubernetes to potentially burst out to some place um and a lot of the at least at my old job. A lot of the researchers were more interested in using things like keep flow and, and it just made it a lot easier to get going. There.

H

Yeah, I agree, I think it's a both and I completely agree about.

C

Yeah and see what one sorry go ahead, no go ahead. Go ahead. I was going to say, like one question there would be. Is there anything that is prohibiting people from moving towards using vanilla kubernetes as the scheduler of choice, or is it mostly just familiarity with the old stuff? So, let's continue using what is not broken. I guess.

A

Yeah, I think the answer there is there are things missing and uh at least for us at least for us, the things that are missing is uh like priority queueing, uh the notion of a queue on top of just the workloads on kubernetes, and then the notion of fair share to optimize the cluster usage.

A

That's another one that is, that is also stopping us priorities and preemptions exist already at level of bots. So that's already something there yeah.

C

A

There was a very nice talk at the last coupon from um I forget his name, you, one from apple apple.

C

Right, yeah, yeah.

A

You shared that I think, on the exception yeah, so they are trying to implement those concepts on on the on the scheduler, and this is also what like volcano is doing. Similarly,.

C

Yeah so then, and then, therefore a follow-up would be kubernetes allows for custom schedulers. So I mean, in fact I think volcano is an example of that are people building their own custom schedulers, because they're supposedly have not done myself but supposedly fairly straightforward to build. So is it something that people consider and people say we'll just build our own scheduler? Is that an option.

E

Like I, I've seen people building their own scheduler. uh There was a. It was a problem for a bit because there weren't a lot of hooks into the various different points.

C

E

Gets considered, however, uh that has largely uh changed, um I think, as of the 1 21 release, so like this this past year, there is a significantly more hooks added to uh potentially you know to make x, writing or extending schedulers easier.

A

I think okay yeah, it might be just it- might be more than just hooking the scheduler as well like. If you want to introduce queues, you actually need to to handle the persistency of those queues, and if you want to have multiple queues and a priority, there's quite a bit of logic there that all these systems are very good at, because they've been developed for ages now, a lot of them. So so it's not like a obvious transition.

A

C

Better, that's it! Thank you.

B

We looked at that briefly, and um the issue for us was that we wanted to be able to schedule across multiple clusters, and so um then you look at cube, fed and that it didn't all work to do multiple classes, which is why we ended up writing armada.

B

I mean it was easy enough to do the uh custom scheduler part of it, but not the multi-cluster part. So.

A

I think it's good points, that's a good point. Actually, like the experiments we've been doing with managing things like hd condor with kubernetes, even if we are still submitting like under, we could have multiple clusters managing the condor daemons and then have central schedulers somewhere else. So you could kind of benefit from the kubernetes uh like operations, uh simplification, but then still use condor.

D

I always thought your uh reason for not moving away from condors less about the lack of features and kubernetes more just the sort of inertia of being able to change user behavior and the fact that they will know how to use condo and thousands of them yeah but fair share. Like.

A

Fair share is is something that has to be there. Basically.

D

Yeah, but I mean- even if you did- let's say that was you know you either use armada or yes, it's just in kubernetes. Even so, you'd still have to convince a large user base to start doing something differently. Yeah, I think that's true for a lot of yeah.

D

My gut feeling is that's actually the one of the main reasons that lots of these traditional places are using the traditional software, because it's how things have always been done, and it's kind of hard to force people to change.

B

I've been joining a bunch of uh hpc meetups and it's amazing how focused that group of people is on hardware like they're so into the late breaking hardware, and how much uh we're going to be able to pump over this pci pipe and the dram and this, and that and and they're just fascinated at throwing more hardware at the problem as opposed to what we're talking about, which is how to use that hardware more efficiently.

B

And you know yeah I've gone to a couple meetings now and it's absolutely at that lower level. The entire conversation so just interesting where people's heads are at.

C

One last question: sorry, so do you guys anticipate that the existing kubernetes or the existing kubernetes scheduler, the default category that comes with kubernetes, will have options for a bunch of these uh going forward? Or is this always going to be, like you know, default scheduler can only do this if you want something more specialized, either build your own scheduler or like use this other open source, scheduler and whatnot. Is that where do you see that going.

A

C

Maybe personally, I would expect.

A

That this goes into the into some not into the like necessarily upstream kubernetes, but in some sort of like crd and well supported by like to make them almost first class resources. uh I don't.

C

Know if they're implemented.

F

A

Crds got it yeah.

A

And it's not only like hpc htc, it's also the.

C

Yeah workloads.

A

All right, I think we move to the next one. Then it's question one all right: this one should be.

C

A

I think so, actually the this was pretty overwhelming on premises. I think from all the responses we got, one that mentioned hybrid, so I think this is. uh I think the main question is: is this staying like this or are people looking at hybrid deployments as well.

A

And what are the stoppers there.

B

I mean for us it's staying like this.

H

uh We're certainly evaluating and exploring hybrid for us. The bigger issues were um things like the united states um government data protection stuff around fed ramp um authorizations and things like that. That being a government entity, that's that's the biggest barrier, um but we are starting to explore that uh that hybrid thing, but not not really for hpc, it would be more for um for workloads um that could that I don't know we haven't, we don't we don't certainly don't have any clear workloads that are like. Oh, this would be perfect.

H

It's really just kind of exploring, what's even possible, so it's very, very early stages.

D

I would imagine a lot of this group have got a relatively established. Infrastructure have already have on-prem, so it would start there. um Probably all various different degrees of security concerns as well, and you sort of know what you have and how to trust it, and also probably just large data sets as well, which is probably a factor which might keep you on prem, because you know transferring large amounts of data around the cloud could be prohibitively expensive and also needing that the equivalent amount of compute to be able to make good use of it.

E

One of the reasons uh the university of michigan was looking at. It was because a lot of the grants were coming with like cloud credits, so it'd be a lot easier to give people one interface that they're familiar with and just sort of abstract it all away. So the cloud credits could go to you know it could be gcp, it could be amazon, it could be wherever but they're still getting an interface they're familiar with and know how to work with.

D

It'll be interesting to see what a new org would do that would fit into this group. So if there was a new company or uh institution invented tomorrow, where would they go?

D

You imagine they would start in the cloud, because it's easy to do so, but I don't know.

B

I suppose the counter argument jamie to like that the big data sets we have on prem is that companies that use cloud data sets like data providers who are cloud-based initially, then, if you're in the cloud, you don't have to move them as far to your on-prem uh location. So you know we might even be in that state for some.

B

If we wanted to.

D

Yes, it depends whether your net consumers or producers of.

B

Data yeah yeah.

A

All right, I can, I cannot hear we I think the answer for hybrid is is ours, so we are already deploying some workloads uh in this hybrid mode and the ones we do are the ones that are from this embarrassing parallel type of workload. But we also have a couple where we actually established links uh network links between our on-premises data center and some regions in different clouds.

A

It allows us to kind of expand the data center and in there we can do more, like any kind of co-located workload can go anywhere and still have dependent dependencies on on-premises services.

A

It's much easier to do what was describing, which is you will you depend on the kubernetes api and you you, you just use it for workloads that can be loosely coupled and don't have like interdependencies that would require a low latency or some sort of special network connectivity and the motivation is really bursting and especially for accelerators, uh which we don't have many uh on premises. Right now,.

C

Mind sharing like how much do you guys go into public cloud like when do you when there is this thing? How many number of nodes do you spin up uh in public cloud for these kinds of when that happens,.

A

Well, it depends like which unit uh for for for the batch systems, we can really tune the amount of resources that are there um for for things like the ml workloads using things like kubeflow, for example, um we actually auto scale the clusters, so um they will. They will only scale up when workloads go there and we try to define policies on what can go there.

C

And do you have one kubernetes cluster spanning your that's kind of a hybrid cluster spanning your on-prem and okay yeah.

A

These are multiple ones, yeah.

C

Okay, because I was like that, how that that would be like a networking miracle.

A

No it's possible because we, for, if you choose a region within a cloud you can set up this, uh this extensions of the network, and we do that. I can expand our own premises. Data center to this specific region, but this is.

C

A

Super flexible, so, ideally, we would need this kubernetes api and setting up sort of vpn if needed, there's a very cool project that is called a tensile coupe.

A

It's a very hacky thing, but it's basically an implementation of the virtual complete where the node is backed by coordinates.

C

Oh yeah yeah, I in fact yeah I've heard of one more thing called nodeless, where essentially it's virtual public again, but so essentially you behind the scenes it will go to wherever you want it to.

A

C

And run wherever, whatever you want, but again cool, very cool.

A

All right any other point here.

A

Okay, so then we move to a question which was, if not already, do you plan to move these workloads to kubernetes, please expand and yeah. The couple of questions we had was no, uh but uh the ones with more details. It said we have workloads in kubernetes. That's for hpc some use governance to launch jobs on supercomputers. I guess that kind of makes sense.

A

Then, for some workloads, I guess it's the answer, um portability being the reason and trying to burst this is in line with what was describing earlier. I guess uh mostly already on kubernetes planning interested or another. So I guess the the next question is: what's stopping us, we already covered up it. I don't know if anyone wants to add something.

A

For those interested, what is the stopper right now.

A

Everyone wants to dig in, otherwise we move to the next one.

E

And for like it was, it was honestly just a lot of it's. What people were used to yeah started going back to what we were talking about earlier.

A

I guess we already covered most of this. I guess all right, so we know jamie. Do you want to pick up this? One.

D

Yeah sure so this is around asking people if people access kubernetes, directly or via an indirection layer, uh it's actually quite interesting. I think so. No responses for just directly majority over the majority to being both um none. I suppose people are not using kubernetes at all.

D

Presumably, um I suppose really is probably what I expected to see in a way uh we can't really tell within the both how much is one or the other um and we've got different. I mean in our case anyway. We've got different groups of users where some people are a bit more sort of power users and do access kubernetes, directly and obviously the administrators thereof, but most of the our researchers anyway go through tools which we build for them to help them do what they need to do rather than using kubernetes directly.

D

I don't know what anyone else's thoughts on us.

C

One question here when someone says that they use indirectly, does it also mean like crds and stuff? Is that also kind of indirect use of kubernetes or no.

D

I guess I was thinking more like whether you're using cube, ctl or not.

C

uh So cube c user cube ctl is direct. Anything outside of cube ctl is indirect, essential right yeah. So some kind of python.

D

Wrapper framework yeah, yeah yeah, so some of our users will be doing stuff on kubernetes. They might even know they're sort of using kubernetes in a way. They're sort of using some tools build an image and push it and then run some things, and you know that's why we consider an industry.

A

It's important also um for the all the role-based access control that we've discussed in the past and the past and credential management of this.

E

At the university of michigan, we had people like at like every tier people that didn't care they didn't, they never wanted to see kubernetes, they just cared that they had a container and it ran some place.

E

um We had people that like wanted access for like troubleshooting purposes or just to diagnose problems, um but yeah we were able to like, and then people wanted direct access to the api, um and we got really good at having our back profiles to allow that sort of thing and make sure people you know couldn't you know, couldn't get out of there basically name space.

C

I'm also learning a little bit about some of these modern, newer projects, or at least modern and newer. For me, uh which is uh ray, uh be a v-a-e-x uh anything one more. uh I think. That's, no, not that stacks is different, uh dusk, yeah and and some of those things I believe, the way they expect you to run with kubernetes.

C

Is you have your local cube config, so they, the the das scheduler, will run pods inside kubernetes, so the user who's using it doesn't know that you know I mean they know that there is kubernetes. They have to set up some things, but they are not. The user is not the one who runs the cube, ctl created or whatnot. So it is again. I don't know how many people are using these modern services. Yet uh again I don't know about modern sorry.

C

I keep saying modern as if it's modern for me, but but uh but it's possible that some people may not they, since they themselves don't use kubernetes, it's some abstraction layer there it's, those might be the cases yeah.

A

Yeah, I think, for for the particular case of task. It also depends how you use it, because there's these two modes, where you can submit directly or using the task gateway? Yes, okay, yeah, the das gateway one where each user has their own cluster. Basically, it's quite interesting.

A

We had. We had a presentation a while ago about this. Actually, oh somewhere in the archive.

D

Let's move on so scale, so we just asked uh compute resources in terms of order of magnitude of cpu cores from 100 to over ten thousand um the bigger well, not so majority, but yeah. The the biggest response is large, which maybe isn't too surprising, because that's the kind of thing we're all doing.

D

um I don't know who's got less than 100. Cpu calls interesting what they're up to they've got one cluster, I suppose and they're playing with it. um Well, that was 20. That was two out of the eight responses. Actually.

A

Yeah we have half of the replies being 10 000 or more so it's it's really irrelevant sizes yeah.

A

How big is the biggest if people want to know to say.

G

Can you share with jamie or is it important because I don't think I'm allowed it's bigger than.

C

G

Okay, okay, yeah.

A

G

A

Goodbye how we can we can share it. Our data center is three hundred thousand, and eighty percent of that is for four patches.

A

I think it's actually more now, but yeah sometimes.

D

What's the refresh cycle for you on hardware, five years that just goes along with the experiment.

A

uh That's different yeah, no, it doesn't. No, it doesn't go with it. It's just a five year warranty.

A

All right should we move on.

A

So the question was about gpus and it was actually it was. The interest here was to see how much people are already integrating accelerators into these kind of systems.

A

So the replies are pretty much integrating them, although um only one in one case or no like yeah, still quite relevant with a thousand or more, we still have quite a bit there. So one one question I had. I don't know if people want to say other things about this, but one question that I had was uh what types of gpus are this: is it all, nvidia or, and also is there any sort of virtualization or is it all like pci password like and dedicated cards for, the jobs.

A

Anyone wants to pick this one.

A

I'm wondering if the people, not speaking, are just not able.

G

Or uh shout for help on the chat, if you can't communicate, uh I just want to say that um the.

H

uh We we added some gpus it that the hardware took like three months to come in and we using the gpu operator got the nodes up and running allocatable in the cluster. In like two days you know the gpu operator was awesome, and uh I really can't say enough about that. I think it's really cool. How that's uh how nvidia is able to kind of do that and just kind of throw that over the fence, and I don't even know how much they support it.

H

I mean they do some, but um it's uh it's just pretty solid. I don't know it was neat. I was excited.

A

All right, sick of that, are you doing any any sort of virtualization of the gpus or is it just.

H

So actually we just here um uh gpus in to start doing some of that stuff with, but we haven't haven't played with those. Yet those are sitting on the floor getting getting installed, hopefully in the next week, so but yeah I know we haven't it was they were voltas, um I believe, was the ones we have today so but yeah so so then you know we get like a jupiter notebook that allocates a full volta and they use it, like maybe less than 10 of the time.

H

So we're like well this isn't you know that was kind of what kind of prompted the the ampere having a little bit more finer grain control over scheduling. So.

A

Yeah we we offer also the possibility to do uh this uh virtual gpu, that uh nvidia already supported with t4s and v100s, but it was kind of time sharing. Oh okay, we realized that, in addition to being very unstable in terms of performance, there were limitations in doing things like um that there were. There were some bits of functionality that that were not available um for for for this sort of driver. It also needs an additional license, but that we we managed.

A

I know that the new versions, the 13x drivers, already support all this functionality that we required. So we are giving it another ago, but we also are expecting the 100s for, for me, yeah cool. Okay, that's good to know.

A

Is anyone doing anything other than nvidia.

B

I don't know, I don't think we're doing anything specific. I'm also not sure what we're allowed to talk about jamie in terms of what we're doing other than gpus. uh What do you think.

D

uh You mean other other vendors other than nvidia for gpu. So it was a question I think.

B

Yeah or other type of accelerators I'll be talking about that question at this point.

D

uh I don't think we've got on to that. Yeah.

B

Okay, I thought that was that.

D

Yeah, no, um I think we're just nvidia at the moment.

B

D

Hey ricardo, have you what about you guys.

A

It's easier for now, but uh yeah. We would like to get uh something. In addition, there are. There are sites because we collaborate with a bunch of sites uh around the world and there are sites that have amd cards as well, so we started looking at integrating them, but they run properly code but uh yeah for now it's it's all anything yeah. I think they've got pretty mad markets there or we get, but we also have issues with the delivery times.

A

D

Been waiting for them for months, I think that's the case across the board at the moment, for any kind of hardware really.

A

All right I'll move to the next one, because we're actually going fast on time as well. uh So the next one is uh other types of accelerators I put here fpgas, but actually one another reason we burst into the cloud is to use things like tpus as well. So I don't.

C

A

Someone wants to expand here, especially on the fpgas, and maybe it gives some details of how they are integrated.

D

They're not integrated in any of our sort of cloud native kubernetes type stuff. Yet, though, that's all quite quite separate beasts currently not to say it won't be ever.

B

Yeah- and I think that sort of following that there are a bunch of ipu's and tpus and you know insert whatever character. You want thing that people are dreaming up these days, that we're looking at in lots and lots of ways but uh yeah there's nothing. That's actually doing anything at the moment.

B

You know we're looking at all the graph cores and sub novas and uh what are some of the other ones um takians or what are those other ones um ascension, or something like that? There's a there's, a bunch of those things that are being tested and played around with, but nothing that's gone near to production or kubernetes status. So.

A

Bob, do you know any any specifics about running fpgas and kubernetes.

E

Honestly, I experimented with it back when I was at the university but like outside, of uh mounting the device into into the container like beyond that, not really um it never got beyond. Essentially me messing with it.

E

Okay, um I haven't really looked at. I haven't looked at it uh really since then,.

A

Okay, now, maybe maybe we take it as an action act, I'm also to to investigate a bit uh where we are with this maximum peaking inspiration.

A

I have at least two things which would be.

A

To engage with other communities that we mentioned above and now maybe investigate a bit more integration with fpgas. This would be an interesting one.

A

For those that replied like this is uh just like seen as a like, an extra pci device that is given to the job or how does that work.

E

Well, oh sorry, as uh um I know, there is a way of mounting the device directly in there. uh Intel actually has like an operator that that does it too, if I recall.

E

It's all through uh device, plugins.

F

A

Okay, I think we can move to the next one authentication, so we covered this a couple of times in the past.

A

I think I don't know if anyone wants to add anything to what is already here. I think we see yeah x 509 in kerberos so off and the main thing would be: how are these credentials being maintained and like refreshed for long-lived jobs and things like this? I guess everyone has this sorted out or any problems there.

D

I think it's interesting that almost everyone who responds it's got multiple things which I think quite telling.

D

I don't know anyone who seems to have got their stories straight on this completely.

D

It's a lot. I guess, when you're dealing with legacy things and new things which we all are there's going to be a combination floating around and that's always a challenge.

D

We haven't got away from kerberos, it's still alive and well.

B

I mean, if anything, we're getting doing even more of it by the day yeah.

D

It's like the zombie's hand, coming out of the crypt grabbing your ankle and just thinking yeah totally.

A

There you go bye, okay, so uh maybe we jump to storage, then uh jamie. Do you want to take this one.

D

Yeah sure uh so yeah question around how we handle data in our clusters, what kind of file systems people use or other um quite split lots of ceph uh cfs, that is, um people choosing multiple as well, but yeah lustre, gps hdfs as well. Quite I mean there's yeah lots of different various responses. I don't know is anyone interested to know if the hdfs people are on the call? Actually, I haven't talked much about that previously in our group.

B

Yeah to be interested group, whether any hdfs users are looking at ozone um apache ozone is a replacement for hdfs. uh If any hdfs users are on the call be interested.

B

But maybe they're not.

D

It doesn't seem like anyone, I'm I'm misreading the color actually I'll just realize.

A

That would be why yeah the pinkish thing is ffs lustre, deprecating and gps gpfs as the future.

A

All right, I just saw here also in the chat nathan. I don't know if you cannot turn the microphone, because I just saw a couple of comments that you had. That would be quite interesting, which would be uh how many sites, how many of these sites are using containers in sloan.

I

Yeah I mean we slurn's battle, I mean forever had integration with the standalone container systems and singularity has been quite popular, and the new container sport coming in just be nice to understand how sites are doing that.

I

I mean in theory, you could take a container run. It here run it there.

I

I mean that's the goal right.

H

Sort of I don't know I mean so, did you see that apptaner is the new singularity they just? They just announced that the other day um the I feel like I feel, like you, know, hpc containers people want more than than what they think they want kind of thing right. I feel like the name of the game. You know we were working for a while on trying to replicate singularity, contain or hpc containers with podman and really the amount of holes that you kind of poke in the container.

H

It turns into more of a sieve than it does like a container right, because you really, you really want to bind mount up all all of your, your your blast, libraries. You know the gpu line. You know you want to pull all that stuff in off the host right. You know it kind of kind of necessarily breaks that isolation.

H

You know I mean I still think, there's really good stuff about it um and uh and even like, uh like nurse um nurse, showed that uh with what are they, it's not singularity, they've got their they've got another one based on docker, but that um python applications actually perform faster across a cluster in a container than outside of a container right. It has to do with the python looks up paths for um linking for dynamic, libraries and stuff.

H

It doesn't have as many paths to look up in a container, because the way that the way that you link in a container, I guess, is compared to like a normal hpc host. So it's kind of funny, but um but I don't know I mean I don't know we get. We get tons of requests for people to to to support hpc containers, but um and people do use them, but I don't know I feel like I feel like uh we always have to have this like hard conversation of like okay.

H

Well, you're not going to get. You know, repeatability completely you're not going to get isolation completely. You know it's like all these little. You add on to all these things. You know.

I

Well, that's there's been a lot of research like a good number of sites on how to get the performance out of it, like the common trick now is to bind mount uh the mpi layer in so that you use you know, especially on the craze and then, of course, breaks a whole bunch of other stuff.

I

um None of these limits are new. Actually, let me go look up the paper or the presentation I have on it these these lip these. These issues have been around a long time. um Yep.

I

uh There was a group when you want to use. When you want to go fast, you will lose compatibility yeah. I mean necessarily.

I

um You know, 100 utilization isn't exactly the top of their list or they want to have some kind of network isolation and they are willing to pay the price for having. um uh Was it plaid or something along those lines on hpc, that's just not acceptable.

I

I mean you pay an obscene amount of money for your infiniband or whatever latency low latency interconnect, and you want to use it because then otherwise, you could just be using the one gigabit ethernet.

H

There's a good good quote um from another guy in a different lab who said that uh that hpc containers is teaching a whole new generation, um the uh of of linking um for errors right library linking errors right. You know- and it's it's so true, because you're right, that's what you're doing you're mounting it off. If you want to get the performance so yeah here.

I

I put the uh the link in there, I mean we did this back in 2017 um and these aren't solvable by containers or anything else. I mean and the whole world of issues come in when you want to swap architectures or compile against ssc, 4 versus scc3 or whatever.

I

And then there's a lot of sites with the hard requirements of reproducibility and bit for bit reproducibility, at which point you have to basically run an emulator on a new piece of hardware to get that yeah, and I mean it really does matter because you run a csm or wharf and your hurricane hits louisiana versus florida and it's the same input same program.

I

uh There's a lot of problems with that, um especially with the move to the uh the single floats, um the gpus. I mean you get the fancier nvidia ones with the double floats, it's not as much of an issue, but it still matters, and then the lack of the ieee float standard being consistently implemented completely makes it entertaining um yeah. I posted the link of a lot of the limits that you know been around for a while.

I

um I don't think anybody's gonna really solve any of these anytime soon I mean once they solve the halting problem they can, but.

I

My thought is more just to make sure that the containers can work so the user can, you know, develop on their laptop, throw it on the hpc system, throw in their kubernetes system or have them burst to each other whatever, but it'd be really nice to know how sites are doing that.

I

I mean right now, there's a lot of uh glue work that goes in for like um getting jupiter books to work on hbc or some of them run them on kubernetes and then burst out to hbc and stuff like that really nice to know about, you know what the sites really need. What they're doing I mean I understand the use case of you know you want to use coop flow, you use argo or something like that or hell you don't care how it runs. You just wanted to run.

I

But yeah you hit a lot of the complications. I mean in a lot of cases. You're gonna have to recompile absolutely everything to get it to the full performance. You know when you're jumping from your laptop, which may be like an armed chromebook to you, know a zeon box or something like that, or even a power, eight or power power, one power where we power ten. Now we should probably.

D

Do the rest of this, because we've only got a few minutes.

A

Yep, let's browse with the rest, and then we can come back. uh I think we can.

D

C

Over like three or four minutes.

A

Till we started late as well thanks a lot yeah. um Should we browse real quickly, then monitoring monitoring, we see prometheus okay, uh fluently.

A

uh Alerting with no use there's nothing like very outstanding there, it's pretty standard, I guess yeah, that's important.

A

All right, so this is coming a bit to what uh nathan was just referring, which is uh how our container image is built. I think this is a one of the replies I had for him, which is uh in most cases we don't have people building locally, they just push somewhere and there's some sort of ci cd that will build for multiple architectures, um so those systems here we get gitlab, jenkins, tecton and then manually.

I

A

But then gitlab tech manual again uh very likely manual, okay, there's quite a lot of manual um jenkins.

I

Be nice to know how they're actually using the get lab runners with jenkins to do it. Yeah.

A

I

I can tell you.

A

I can tell you how we do it. We have multiple runners on each of the platforms and when you push your image, they will build in parallel and then push to the same registry, and then whatever runtime is pulling. The image will pull based on the architecture that they are deployed on.

A

Does that answer your question right now.

I

I I meant more, like they'd, use an ssh to go on a rest api. What is the runner calling?

I

Or it's going straight through the coops, the cube api.

A

The runner for building the image you mean or.

I

Yeah well like here the top one get lab runners. I mean there's a few things you could do with that right, so I'm just wondering how they do it.

A

So for the git.

I

Lab runners, you push.

A

To uh to a branch, and then the the runner will will get the web hook and will just pull clone the code and build locally on where the hardware the run is running and we basically replicate them on all the architectures.

A

Someone wants to add something: maybe.

A

All right, then, we go through uh registries, so we have all it's the answer.

A

Anyone particularly happy or unhappy with their current choice.

D

Or any any hard issue to raise, we use artifactory we've run into some scaling problems with it, but we've recently started looking at um dragonfly, it's like a sort of caching yeah and uh it's very early days, but it looks pretty good. Actually, we started originally looking at something called kraken, which I think was out of uber, but it seems to have died in a ditch. So then we sort of moved sideways onto dragonfly and it looks pretty good and that's sort of taking some of the pain away from mars factory.

A

Anyone else all right, let's go through. I think we only have two more so languages. um It's pretty much like half is bison and then the other half split some four turn. So that's pretty good.

A

Anything to highlight to anyone.

A

All right, so we go quickly the last one additional tools. I guess it's more regarding deployment here, there's terraform argan city public twice,.

A

Yeah at home sounds pretty reasonable. I think uh I think that's it I don't know do we want to highlight anything particular already three minutes over.

D

Pretty good and just want to say thank you for the people that responded yeah. Thank you.

A

um Yeah, I think I think I I took a couple of uh action items uh I took also from the discussion we just had here about uh how people are actually using containers in these environments. uh What's the motivation to do that and any kind of limitations they're writing on their servers.

A

So maybe maybe we take those as um as topics for you for our next session yeah, otherwise yeah. Thank you very much everyone and um we meet in two weeks for uh jamie's jen's session.

D

I think we are going to try and reach out to the cartographers working group right, let's see if it doesn't work. I hope.

A

D

Guitar once you need, it sounds like it's on me to sort that out right.

A

Cool cool, okay! Thank you very much thanks. Everyone.

F

G

E

E