Cloud Native Computing Foundation Research End User Group, 15 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF Research End-User Group Meeting: Bare Metal Deployments

Description

Join Jamie Poole and Scott Solkhon from G-Research as they discuss their approach to bare metal deployments.

The User Research user group meets regularly to discuss and advance Research Computing using cloud native technologies.

If you'd like to learn more about the CNCF End User community or join as a member, you can find more information at: https://www.cncf.io/enduser

A

All right so welcome everyone to another session of our research user group, so today's topic has been uh requested for a long time and we never actually got to cover it so elements, and we have jamie and scott from g research to present how they are handling this and and teach us how to how we should do it so yeah onto youtube.

A

B

All right thanks right, let me see if I can work out to share my screen.

B

Don't think I have shared on this platform before wow there we go all right. Can you see that before we kick off.

C

Yeah, it looks great great.

B

Okay, all right so um yeah. This is uh me and scott are going to talk to you a little bit about uh bare metal kubernetes at g research, so not necessarily telling you exactly how to do it or how the only way it can be done, but just talking a bit about our adventures with it.

B

What we're up to uh what we've learned along the way um so in terms of introductions uh we've, actually both of us been to cern to visit um ricardo and co and managed to get the obligatory photo in front of, um I think, that's alice, um so we put those next to each other, but yeah, I'm jamie paul. um Most of you probably know me because I co-host this with ricardo every other week anyway, although haven't been around for a few weeks.

B

So apologies for that and I'm the compute platform engineering manager here at g research, so responsible for all things: kubernetes and batch compute and calc farm, and that kind of thing, and then I've got scott here with me who I'll play introducing hi uh yeah, I'm scott. So.

D

I'm cloud engineer: I work mostly on openstack uh but yeah, uh so yeah. A lot of that is ironic, cool.

B

um Very very brief bit about georesearch for those who don't know so we're a fintech company based in london. We run a large distributed research platform for teams of quants to look for patterns in real world, noisy data sets of financial data uh looking for for patterns for for our clients um and currently we're we're still saying for a while. But it's still true migrating large amounts of our batch compute workloads from windows and hd, condor onto kubernetes and linux and containerization, and all that good stuff.

B

uh So yeah without further ado, go straight into the ironic portion of the presentation which I'll hand over to scott for.

D

Okay, so uh so yeah a little bit of a background. First of all. So what is ironic ironic is an integrated openstack service which aims to provision bare metal machines instead of virtual machines, uh so ironic supports using vendor-specific plugins which implement additional functionality, such as moving machines between different networks, um so yeah and the main things about well. The main thing for this talk really is to focus on the different states. We have in ironic it's not limited to these, but the main ones are rolling cleaning holding and provisioning.

D

So how does it work under the hood? um So ironic, it's pretty straightforward. So what it is? It does ipmi and pxe and a mix of mix that and around disk image, um and then it turns machines on and off and moves in between different networks as they move through different parts of the build, so ironic can be, can be deployed standalone, but most common way to do it and probably like in a production environment. um It sort of sits beside um other openstack projects such as, like nova, neutral and glance.

D

Just a bit of background, so nova is used for deploying like vms, like virtual machines. Neutron is your network and and then glance is like an image catalog. um So yeah uh ironic will use those uh different services um to to get images or change networks or whatever it needs to do so.

D

The good thing as well is when a bare metal machine is deleted uh by the user, it's cleaned and then um it's just returned back into the available pool, and then someone can else just someone else can just pick out that that pool so this is really high level um diagram, just to sort of show um the sort of enrollment uh stage that we've got.

D

So if you look on the left there there's a few open source products we use, um so one is kyobi, which is a sort of a subproject of the color ansible project in openstack, and that's used to deploy also use it to deploy new bare metal nodes into ironic as well. So essentially, it's just a bunch of ansible and we use um muse jenkins to sort of uh to orchestrate that I guess so. If we look at what the enrollment phase actually does, so we go through pre-inspection.

D

First of all, it's just like the pre-steps before you can actually look at the nodes um and work out what what's going on in there. So we create a record of it in the openstack api. Then we set the resource class we buy. We apply some baseline, bios and ilo settings. The important bit here is the resource class. So a resource class essentially is just like um defines like what I know like a type of node, so you might have like a certain type of gpu, node or cpu node or like specialist hardware.

D

um You need to find that as a resource class. It basically says this is what my server should look like. You just have this much ram these disks and all that kind of all that stuff.

D

So um we define all that- and we say this is what I expect these new nodes to look like, and then we go through to the next phase, which is inspection, which is an ironic sort of state. So it will turn the server on pixie boot into the ram disk, and then it will discover what hardware is there check for things like cabling issues and identify the switch it's plugged into? So when we move it, we know which switch to log into to actually move the port, and then it will create those ports in ironic.

D

um So what that allows us to do is then basically cross check between the resource class and what the um server actually actually has inside it, because if you've got, if you think, if you've got a big pool of servers and when you hand one back and you take a new one, you want to make sure that you're getting the same server back well, not literally the same, but one that has the same spec.

D

So once we've done the inspection we can. We then know where it's plugged in and which switchboard so neutral will then move it to what's known as the cleaning vlan, which is essentially the same as that the cleaning state. And then we go for like a what's known as cleaning. So cleaning runs inside the ram disk image again and what it does is uh it will just boot into it, and it will have like a set of steps, basically python scripts, and it will just run through those in order of priority.

D

So we do things like updating firmware, verifying that the ilo settings are all correct. Ntp um set up the storage, we wipe the hard disks and then we check if there's gpus in there we'll check their health as well. Once it's done that it should be good to go almost so we just finally run some tests on it, so we run some burning tests and then we move it to the holding vlan, and then it goes from cleaning into the holding state.

D

That essentially means it's ready for a user to pick it up the other side. So if we just look back at that diagram, you shall understand this a little bit more now um so nodes come in from the left for our automation into ironic api. We inspect them and then they get moved into where they are in the data center. So a conductor basically is a microservice in ironic, which is um which its purpose is to sort of look after a group of nodes.

D

So you might have like a common set of nodes or like an area in a data center like an availability zone or something like that, um so they all just get bunched up and yeah and then it's all ready for ready for people to use the other side.

D

So moving on onto the deployment side, um so there's a person there. They will pick um a flavor and a network and an a-z in an image um and then that transforms into some novus uh some stuff that happens in iran in openstack and then out pops a bare metal node, the other side, so we'll just take a little bit of a closer look at that happens.

D

Is the user requests the new bare metal machine via terraform in our case, but I mean you can just do it by the api if you really want to um and then the flavor selected is the thing that maps to the resource class. So earlier, when I said, you've got a resource class, it's like a type of server.

D

The flavor is basically the user defining what type of server that they actually want to pull from the pool, and then the network and the availability zone that they select maps to some sort of location within a data center that allows you to sort of scale this to quite a lot um quite a lot of servers.

D

So yeah. If we look back here, that's just that first bit just up here, so the user selects and they go into the openstack process.

D

So the openstack side of the process um you hit that you hit the nova api, then that will talk to placement and the scheduler, and that will basically look in the pool and it will say, what's available, give me the first node of the top or the first hundred nodes or first thousand nodes whatever. When you select neutron, although then go and move all of those into the provision in vlan, so then they go into that provisioning state.

D

So we go from holding back into provisioning and then this is the state where we get ready for the user to use. So in machine provisioning. We turn the server on um using ipmi. We pixie boot into the round disk image, and then we apply a few bios settings that might be like um hyper threading on or off. That's probably the most common one. But you can you can configure anything you want um as long as it's available via the api um and then we pull the user's image from glance.

D

So the user will specify that image when they actually build the machine. They don't want to run the ram disk image, because that's what all our tools in it hasn't got their tools in it. So this would um in our case, be flat. Car would get pulled from glance, which is the image service in openstack, so that basically explains the blocks there on the right, so yeah request comes in schedules.

D

Nova compute will coordinate um some stuff in neutron to move it to the right uh vlan and then the ironic conductor will pull the image down, put it on the node then, and then from there. All we need to do is move the vlan again into the the requested um vlan that the user wanted, and then we just restart the server and then the server will just boot into it into an os and then present a prompt screen that the user can log into that means.

D

Yeah then, hopefully, we've got everyone got their metal servers and they're happy to go and use their um their fleet of servers. Now that is all good until the users, then is finished with the server. So the idea between ironic is sort of cattle, not pets, so you use a server for a lot of time or, however long you need it and then you hand it back, go through cleaning and then it goes available ready for someone else to use. So it's really really flexible.

D

So yeah the user deletes the server neutron uh goes and moves the server to cleaning. We go through those same cleaning steps. So if the firmware has changed since um since users handed back the machine, then that will get updated, wipes all the disks. So it's all nice and secure when the new user use uh gets given the server um and yeah. We checked that it hasn't been tampered with or anything like that, as well.

D

Just for extra security checks uh and then yeah neutron will finally move it into holding and then then it becomes available again so just to recap on those states. So, first of all you roll the node into ironic. Then it sits there and it's ready for the user to use um and then we've got cleaning between holding and provisioning really so yeah enrolling cleaning holding provisioning. That's about it really ot.

B

Thank you all right and then more on to the sort of how we use this for kubernetes and research purposes, side of things, I suppose so we uh historically we've always run kubernetes in g research on openstack um for a long time, we've been doing it on on vms, but more recently we started moving on to building clusters using bare metal, uh but this process is pretty much the same regardless.

B

We just use a different flavor as scott was talking about so the way we tend to do it is we define our clusters in um terraform terraform code in github.

B

We then use terraform enterprise in our case, to build the clusters into openstack using ironic machines are built and configured using the flat car operating system and then flat car uses ignition to pull down user data and configure a very minimal kubernetes installation. So our initial bootstrap task basically gets us a small bare metal server. uh Sorry collection of servers running a kubernetes cluster, uh pretty vanilla.

B

um Currently, we still do um the lcd nodes as virtual machines, but for our larger and higher performance clusters we now use ironic and bare metal for the master nodes and all the worker nodes as well.

B

Once we have a minimal cluster, then we apply our more detailed kubernetes configuration on top. Typically, we do that these days using jenkins or arc cd or combination of the two actually sort of undergoing a bit of migration at the moment, and that's just how we then deploy all of our sort of desired state kubernetes configuration on top so things along the lines of ingress controllers and calico and all the other bits and pieces and things that we want to have in our clusters to make them look and feel like our desired gr clusters.

B

uh Once we've done that we then deploy armada. So this is an application which I've talked about a bunch of times at this uh in this forum. So I won't go into too much detail now, but this is just the overall architecture diagram of the the application which we typically deploy on top of these clusters. So you can see here, the the blue boxes at the bottom of the screen are kubernetes clusters in this sort of new world of metal. These are all but high performance bare metal clusters. uh Quite a large number of nodes.

B

We tend to scale up to about a thousand uh and then one our model server sitting on top, which allows our users users to submit jobs, to uh run on the hardware, um a couple of notes just on some benefits, we've seen so far, so that this is really the reasons for us moving to this model in the first place.

B

So it's still early days, but the things sorts of things we've seen are increased stability, so certainly for things like gpu intensive workloads, we had seen some issues when we were running on virtualization that have just completely evaporated since movement spare metal. uh It's certainly been a lot simpler than trying to debug sort of uh kernel level issues within the within the virtualization layer, just to move to their metal and not really worry about it.

B

um Some other benefits we've seen are things like increased network throughput between nodes and external resources, uh being able to use bgp peering very easily uh can be done with vms, but it's a little bit more complex um for us as well. Typically, we end up with much larger nodes, because your bare metal servers tend to be a bit bigger than your average virtual machine. By definition, I suppose- um and for us it's just simpler estate management as well, so we have fewer layers between our our workloads and our hardware.

B

um Fewer machines, fewer bigger machines, tend to be slightly easier to manage than tens of thousands of smaller virtual machines.

B

uh However, there are limitations as well, so some of the things we've noticed so far, certainly um a slower provisioning time, which is actually completely expected, as you can imagine, when you're provisioning, a bare metal server, you're, actually, basically turning on a real machine, and you have to wait for it to power on um with a virtual machine. All of that sort of abstracted away from you, you don't really see it. um There's a lot more precise quota management required.

B

I think you can be a little bit more fast and loose when you're running a large virtual estate. You can oversubscribe things and uh you know over subscribe, cpus and things like that. It's much harder to do in a bare metal environment, you're very much constrained by the physical resource you actually have. um It is a little bit less flexible in some ways and there's some features of virtualization, which we don't get as a side effect. So things like being at a snapshot of vm are quite useful.

B

You can't do that natively using a bare metal server, so you have to sort of roll something or use some other tool to do that, and we have also noticed in a kubernetes world. It can be a little bit tricky, sometimes to mix and match in cluster having virtual machines and physical machines.

B

So we've tended to take the approach of just starting from scratch and building new clusters as bare metal from the beginning, rather than trying to add it into existing virtual clusters, but yeah in summary, um for us we're now using bare metal kubernetes for our highest performance workloads, um we're still also making heavy use of virtualization, where appropriate, so from all sorts of classic kubernetes clusters, if you like, for services and so forth, we're still making good use of vms, but for the clusters, where we really care about performance and we're running lots and lots of high throughput jobs, then we are now moving to my metal and openstack ironic is our metal as a service choice?

B

I think, is it little whistles doctor, but are there any questions.

A

Awesome, thank you jamie and scott. It was uh nice nice summary um anyone has any questions, feel free to just go for it and ask. I will have a couple, but uh I'll leave the floor for two others. First,.

E

I'm here I have a question: um did you look at any other tool, suites uh besides um um ironic or was were you set on ironic because uh I believe it's an open stack project right. It.

B

Is yeah? Okay, we have looked at some other things, um we're relatively opinionated about it. I suppose, because we're already got quite a foothold in ironic, uh sorry in openstack, using lots of other openstack um uh services, as scott mentioned, uh we have actually on, I think, independently ahead of ironic rolled our own metal as a service system internally, which does work as well, but it's kind of nice to be able to use the off-the-shelf open source tooling. That fits in nicely with the rest of our ecosystem.

B

Okay, but I know there's other things as well like mass and others that we we haven't evaluated at in depth, but yeah ironic seems to work well for us.

E

A

All right, okay, so alex did you have a question as well? So you want me to add some point.

F

I was wondering whether they had um looked at the you know. You mentioned the provisioning times the slow, um whether there was any uh looking at pre-provisioning sort of expected images that you're gonna spin up. um I know that when we had the uh on metal service at rackspace, before we switched over to ironic, that was part of the whole plot was to pre-spin up these uh bare metal servers.

F

It was supposed to come back in ironic, but- and that was years ago, but I don't know whether that's actually come back so.

B

It's not something we've used yet I mean certainly some things. We've looked at in our processes where we can save time with earth. Maybe you can cover this, but some of the things like bios settings and things where we want to make sure we eliminate the requirement for reboots and things like that, I suppose during the process, yeah.

D

So um yeah there's a little bit of that we can trim around but um yeah. So the way it's designed is you have one big pool of data, sorry of nodes, and you can sort of have multiple tenants using that where we don't have that, um there's some stuff that we can sort of pull out like, for example, the buyer settings um they're they're applied at like a provision time if they're static, then you just don't have to reapply them every time.

D

You just have to make sure in cleaning and I was tampered with them and then you can save a bit of time. There also there's lots of things you can do about caching, images and that kind of thing, and actually for us, that's something that has been relatively easy, because it's um because these these kubernetes uh clusters tend to use the same image and then we have lots of them that use the same image. So everything gets cached and it's all kind of hot at all times, pretty much so um there's more things.

D

If that becomes slow in the future, we can move glance closer to the actual bare metal nodes. But at the moment we're finding that um the cache is pretty warm and um yeah. Well yeah. It's performing.

B

Pretty well one thing which I noticed, which surprised me actually in my own reaction to it I mean is uh the first time I saw it. It took 20 minutes to to build a server. I was like. Oh, this is a nightmare. This is going to make everything really slow and difficult, um because I'm used to a vm spinning up in 30 seconds or something, but actually when you're used to it and you're doing things at large scale and in bulk.

B

It doesn't really matter if one server takes 20 minutes, if you can build hundreds or thousands simultaneously, um you actually end up caring a lot more about reliability and being comfortable that your automation will just work and you can walk away from it and come back later and everything will be up and running. It will be much worse if it was faster but less reliable. So I always sort of er on the side of reliability over performance personally of the build that is once it's up and running. We want performance as well. Obviously,.

F

I mean for us if we have long-running jobs that take days. 20 minutes is neither here nor there.

B

Yeah I mean 20 minutes is slightly uh anecdotal. I would say: that's that's our current experience for a certain type of flavor, but it's uh it's of the order of minutes no longer seconds, but in that way.

C

So is is that the mode of operation that everybody want, so if uh somebody submits to to run a particular workload, they get provisioned a particular resource or resource type. It's not that some things are long lived and people kind of swap or you know, interchangeably, use the same standing resource it.

B

Depends how you choose to use it in our model? What we do is we have a bunch of hardware and built into clusters ahead of time which sit there and are used relatively constantly by a collection of different users, so it affects the hardware as well being a large pool of hardware is all being shared by lots of different people, um we're quite lucky in the sense that we've got relatively in the grand scheme of things, a relatively small pool of researchers all doing quite a similar thing.

B

So we can be quite uh prescriptive about the hardware that they'll get so we have a smallish number of flavors of cpu nodes and similar gpu nodes and potentially, in future other accelerators.

B

um It might be the case that in other companies who do or other organizations, even who want to do a more like, I guess- offer metal as a service or cluster as a service up to users to actually create their own, that that would be a possibility. But for us we we take the more sort of we provision it. We being the infrastructure and platform teams and then our our users within our organization, then just use what we've provisioned for them.

B

But this would lend people provisioning their own if they wanted to.

C

How does it work that they they submit a ticket, and you you take care of that.

B

uh It depends what you mean so, generally speaking, the way we we have is we have a these pools of compute, which we understand uh the sort of flavors and qualities of, and then we have a bunch of tools and software which allow users then run jobs on them. So they don't it's not a ticketing system. It's really a case of they can just. They are already set up with access to this large pool of compute, and then they can submit jobs.

B

To then use the hardware as they see fit so effectively run run jobs as pods in kubernetes, ultimately, is what happens on top of the hardware that happens to be provisioned through ironic, okay,.

G

So, to get a sense of their time scale, how long do the clusters live and or if it's like a dynamic cluster? How long does the uh the nodes typically live? Like you know, yeah six months on a cluster and every few weeks things shrink and grow or which was.

B

Yeah, so it looks loosely like this, so we actually have multiple of this whole picture, in fact, but if we just look at one of these as an example imagine this is a data center. We have many of these clusters under here each one of these clusters itself, but the cluster, I suppose in it in in of itself may even last for years we might create it. You know a couple of years ago saying it's still running now and we'll still be running jobs on it, the nodes themselves.

B

We tend to quite frequently rebuild, um because I think we actually have a bit of a fetish for sort of rebuilding stuff in gr, making sure everything comes back clean and tidy.

B

So we actually have a separate project at the moment going on to ensure we're constantly rebuilding things and making sure there's a maximum lifetime of the actual nodes, but clusters themselves can last for quite a long time.

B

um We probably also eventually, I think, we'll move into a more rolling cluster rebuild process as well, because, obviously that's long-lived state itself which could get dirty or out of sync, somehow it shouldn't but as possible, but no generally speaking, the clusters themselves live for quite a long time and then the nodes within them are of the order of tens of days maximum couple of months.

G

So this architecture is really for um at that facilitator level, where you're you're building environments for individuals and you're you're, keeping it you're moving with whatever the ongoing research is and the reason.

B

Yeah short time scales on the pods and whatever yes yeah exactly so, we've got like time scales. Then time scales, the pods themselves are anything from uh seconds up to a couple of weeks, say and then the nodes and the clusters last for a lot longer and they were they're just sort of running this primal deal soup of of user workload, um but also just using ironic or any kind of metal as a service is also just a useful thing.

B

If people have sort of high performance requirements or just have a different estate management process, I suppose so I know ricardo and the guys at cern.

B

Don't do this model, so we have a model where we create clusters and then effectively offer you can think of it as like, namespace as a service, so that tenancy is the thing which we offer people on the existing clusters, whereas I think, or certainly when last time we were talking about it over in cern they're, doing more sort of cluster as a service, so people can ask for their own clusters, which then may use something like ironic. In fact, do you do that? Ricardo?

B

Do you have ironic as an option yeah under classes as well? Yes, you do exactly that.

A

Yeah, you can even have like mixed clusters with node groups or node pools in vms, and additional node pools using bare metal makes sense.

A

I had a question because you mentioned it's a kind of follow up for the last one, which is uh you you describe the workflow with github and then the provisioning using terraform. um Do you also use this for like cluster upgrades or or is this like you just redeploy from scratch and you cluster.

B

That's a good question, so we tend to use uh the cluster. Bootstrap thing is kind of a one-time thing to build a cluster. uh If we have quite a long loop cluster, then we can actually do all of our upgrades. Then, from this point onwards, if this makes sense so things like upgrading kubernetes itself, um we have a bunch of tooling to do that. So we can do it in place. Cluster upgrades, um even the kubelet on all the nodes as well, because that itself is containerized um and similarly then uh operating system upgrades.

B

We can do in a rolling fashion because we have this model where here, underneath the long lift cluster the nodes, get rebuilt, sort of sequentially, underneath the cluster uh with error, budgets and so forth, so that we don't do the whole thing at once, um but we have options we can also. If we want just completely blow, you know, coordinate, wait for stuff to drain and then blow it away and rebuild it all. If we want to do upgrades, but yeah we tend to just.

B

I guess the separation we have is terraform tends to be used. For the note, the cluster slash node build process and everything afterwards is through all right jenkins and okay.

A

Okay and uh the other, the other question I had was uh like uh there's quite a lot of activity in this, try to to kind of uh manage the clusters from as if they were kubernetes resources and then just build on things like argo, to kind of make everything uniform.

A

Yes, is this something that you've looked into and is because I was just searching now for the integration of like metal as service components into a cluster api? Is this something that would simplify or that you would not consider.

B

I would definitely consider it and I'm very excited about it, and I would like to do it at some point, but it's just never quite been up the priority list enough for us in our world. I think what it would end up doing is effectively replacing terraform yeah. I think basically we would be going straight from github.

B

Well, I suppose we'd have something to bootstrap our initial cluster somehow and then cluster api would then go off and talk straight to openstack, but hopefully everything we've already done would then continue to integrate nicely, and we would just use that directly. So I think it's really a question of how well supported openstack is by the cluster api right. I haven't checked recently, but yeah. It would be very interesting to do that cool, certainly a limitation.

B

We found not bare metal specifically, but as soon as you get to large scale, kubernetes or any configurations, in fact, within terraform, it is a bit slow.

B

It has effectively maintains a big graph of resources which it has to have to walk, walk the graph. Every time you make any kind of change, and especially when every resource is actually a remote thing that has to go off and be checked, then you can imagine that ends up translating into a lot of api calls which can be quite slow and expensive.

B

So if we could turn that into something a bit more elegant using kubernetes itself, I'm all for it.

G

A

Checking here, if there's.

H

Other questions in the chat of someone ah there you go um when you, uh I have another question um so.

E

uh Do you does your team manage the networking equipment um and do you have like kind of broad control over that or do you work with the networking team.

B

uh We have a networking team who's more responsible for that. So within our organization we have a few different functions and different areas responsible for different things, so we've got an infrastructure function and a platform function me and scott from both of those respectively.

B

There is a team within the infrastructure function who deals with networking specifically, um but what we're definitely finding is having more cross-functional teams is really powerful. So I've got people in my team who have got really strong networking skills and understand that kind of stuff, including down to the hardware, um and I actually suspect, over time, we're probably going to need to develop some kind of special cross-functional team that just looks at performance and tuning of the estate. Basically because we need to be able to do it all the way from top to bottom.

B

Really, especially when we're now dealing with metal. You know we actually need to understand how everything is configured all the way down to the bios.

E

Yeah um we, you know we use a lot of, but we have a lot of bare metal and uh vms um and we just there's this friction with the same networking team professional disagreements, maybe over how the switches should be uh managed- uh and I saw like in during your presentation- you were switching vlans and uh at the beginning and it seemed like you had a decent amount of control.

E

B

Yeah, we, I think, I think we do. um Our networking team is quite sort of up to speed with everything that we're doing as well, and uh I mean like you say, though there was always friction, sometimes between teams, because different teams sometimes operate at different rates and when we've got responsibility shared across groups, it can be tricky, but um we've got quite a sort of singular purpose at the moment. So there's a particular large project happening at the moment, which involves a lot of this stuff.

B

So we've got a lot of people from different teams all working together to make it happen. So it's quite nice.

E

And these clusters are fairly large, there are dozens or hundreds of servers. Is that what you were saying, yeah.

B

Yeah I mean up to about a thousand nodes in the given cluster. um We could go, we've actually decided arbitrarily to sort of stop about there, but that was in fact one of the reasons for the armada architecture so that we could have many of these things, um because we're aware that past a certain limit, kubernetes can't really scale much further.

B

I think the official limit is still 5000 nodes, but I know from I think anyone knows from going to conferences and things that you have to do quite a lot to get that far and then over backwards. So we have a model where we just go to about a thousand and then just plug in more clusters horizontally and scales. Quite well. That way and is armada, um a g, um a g research project, or is that um yes, okay, so it's an open source. But yes, it's come from g research.

B

I'll tell you there's a probably further back in the uh list of um meetings. There'll be a recording of some stuff. We've done specifically on this. If you're interested, okay.

A

I have one more um you: may you mentioned the issues with gpus and stability and improvements by moving to bare metal yeah? Were you doing a pci pass-through, I guess and in vms, and do you remember which, which specific issues you had and how did they think we.

B

Were um yes? Yes, um I can't remember the specific issues, but we were basically getting unexpected errors. Things were being reported, as um you know, not a number and that kind of thing just mathematical errors which shouldn't have been happening um under quite nice circumstances as well. When we were running out of memory- and you know, you'd have to have a few different failures happening a certain way, but somehow we managed to always hit this scenario quite frequently, and then we just thought well, hey, look rather than try and debug all this.

B

Let's just see what happens if we run on bare metal and lo and behold, the problem went away. So sometimes it's just not worth sort of.

A

Yeah, I ask is because we we have been seeing simulations with virtual machines recently right and uh yeah, that that is a tempting solution. I.

B

Guess I mean it makes sense, doesn't it if you think you're just going through this whole extra layer that maybe you don't really need to and there's a lot more software involved? Isn't there ultimately.

A

Yeah, the the issue is: how many gpus do you have per node on average for us uh up to about eight all right? It's the issue is that for kubernetes clusters this is easy to handle, but for, if you have a mix of vms and kubernetes clusters, using those gpus actually virtualization allows you to like expose um is on a multi-gpu node quite easily yeah.

A

um Well, if you just dedicate like parameter nodes to to to people directly as you would do with vms, then you basically like potentially giving them a a really nice way to waste uh pressure, reserve resources. Yeah, that's true yeah! I think I think that's that's. The reason why, like for coordinates, cost is kind of a no-brainer that you can go bare metal for gpus and just schedule directly.

A

All right, that's pretty pretty um cool. Have I have one one question? I don't see anyone raising like. If I, I guess the question that a lot of people will have is if I have a bunch of nodes arriving on a new data center or whatever on premises.

A

What would be the suggestion if I just want to do kubernetes um on bare metal like what's the the best option or and the least uh complicated option to get stuff up and running? You know sort of.

B

Yeah, um quick fashion, I mean I I don't know because I know what we do and we've got obviously quite opinionated about using openstack and ironic. I'm sure I I think one thing that probably can be said of ironic at openstack is: it can be quite complex and it's probably quite difficult to get up and running.

D

B

D

There's two projects that you, probably if you want to have a guide to look at um which sort of lower that barrier to entry, so one is called bifrost which will allow you to just um sort of run um like ironic from like just laptop, that's good for like bootstrapping new environments, where you don't already have a control plane um and then the other is what we actually use here to deploy all of our openstack, which is color ansible.

D

um That is it's basically, a collection of uh ansible uh roles um and basically a lot of the hard work's been done for you. um So a lot of it just kind of works out the box and you can deploy it. Vanilla, open, stack really easily to like a couple of vms on your machine or if you've got a couple of um bare metal nodes, you can deploy a control plane there with relatively little openstack experience um tuning it and getting it um to large scales.

D

That takes a lot of time and experience and working through issues and that kind of thing. But um yeah. If you want to get started, the barrier to entry is not really that high on yeah, either by frost or color answer really.

B

I'll be really interested if we could do some kind of questionnaire for a wider group. Obviously our group as wide as we can get it to find out when people are using metal as a service product. What they're using because even knowing what's out there is a challenge, sometimes.

A

Yeah, I think that comes back to this idea that we've had for a while, which is to do these recipes for different sorts of uh workloads that are kind of specific for research environments. I think, like the deployment on premises and norman bare metal is, is something that is not like super common, maybe because, like most users will be using public cloud providers or some sort of commercial virtualization solution that is already available.

A

So I don't know for research institutions that want to have a bunch of notes and want to to get something up and running. Maybe there's there's something we can provide uh with some ideas.

B

Or pointers, I guess we don't even have to have a recipe for it and say do this we, but we can say we as a collective, have done these things yeah. We know that they work well, it can can be made to work, uh but yeah, that's good. She didn't mention that in this presentation, all of our computers is on-prem yeah. I suppose anyone using a cloud provider can also just use bare metal through through whatever they support as well. I think they will all do now.

G

I think that'd be very valuable for the university community from what I can tell most of the uh the bare metal clusters are hand created from various various methods. So uh what works and what works well would be definitely useful.

B

Yeah, I think, as soon as you do it any kind of scale, then the hand, crank method, just sort of doesn't work, spend your whole time doing it.

A

I I guess the the dream is really this idea of the cluster api, where you you put some effort into the bootstrap cluster that you do by hand, but then everything else is kind of coming automatically via the cluster api. I don't know how far it gets, because then you still need this kind of metal as a service component somewhere.

B

Yeah, there's there's enough bits of surrounding infrastructure. You need still yeah would have to be set up by something, but maybe some of this will become more ubiquitous as time goes on. I don't know. Yeah.

A

But yeah, but so maybe maybe we take this as an action just to to send around uh like a survey like we did the last couple of times with uh asking specifically about parameter deployments. Yeah.

B

That'd be good, also be for those going to kubecon. Actually to do some research, they will be really valuable.

A

Yeah yeah, actually we we don't have a talk this time for the group. I don't know if there are other questions on the topic. Let me see if there's any I'll, give probably got time for one more. If there is one more otherwise I'll stop sharing.

A

I think we're good five.

G

Second, rule.

A

G

A

All right thanks a lot again, jamie and scott. That was pretty interesting.

A

We don't have anything else today, so, but one one thing I was going to mention is uh for kubecon. We still have another session in two weeks, but we.

G

A

uh Talk this time, but we should probably just circulate uh like a slot lunch time or something where we all get together and yes, a it all started in barcelona. So we might.

B

As well get together, I'm yeah, I'm I'm not actually going I'm a bit gutted. I've got three people from my team are going to be there, though so I'll um make sure I send them your way.

A

Still escaping uh this jam session, I can see, I know, I'm sorry.

B

I'm saving up for detroit.

A

Oh, that sounds good.

A

All right- that's um I I don't have anything else for today and if anyone else wants to raise something.

B

A

All right, otherwise, we have a container sage uh in two weeks um and after that could come so yeah thanks everyone for attending and we'll follow up also in the in this live channel. Thank you. Everybody great to see.

G

A

Yeah, thank you.

H