Red Hat OpenShift Ask an OpenShift Admin | Red Hat Livestreaming, 27 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Administrator’s Office Hour (Ep 15): Sizing OpenShift clusters and nodes

Description

During this episode we’re going to discuss two of the most common questions we receive: “how large should my nodes be?” and “how many Pods can I fit in my cluster?” We’ll look at how to determine node sizes, how node size affects the architecture, and what considerations need to be made for cluster sizing during this can’t miss episode!

A

Good morning good afternoon good evening and welcome to another episode of the openshift administrator office hours, I'm here chris short uh executive producer of openshift tv, principal technical marketing manager, here at red hat, joined by my teammate andrew sullivan, welcome andrew you just got off a customer call how's it going. I am. I am happy to be here.

B

As always yeah, it's uh it's been a crowded morning.

A

Yeah my crowdedness is beginning now, so I.

B

A

B

So, uh apologies to all of our uh uh all of the people who are watching who are listening um for my my tardiness. um I was on a call that uh ran, unfortunately, a couple of minutes late, so I do appreciate you sticking around and uh waiting for us to start um it's. uh It means a lot to me it it. It makes me happy on this cold rainy, uh maybe snowy wednesday, uh here in.

A

Here in north carolina, it is actually very sunny and cold here today, like I woke up, and it was 16 degrees fahrenheit, which is several degrees below zero celsius, so yeah, um but it is now warmed up to a beautiful negative six celsius, uh which is 21, fahrenheit, yeah, good stuff.

B

uh Well that you know I I'm in the raleigh north carolina area, um which, despite the fact that we get one or two snows every year, we're always completely and utterly unprepared for.

C

B

And it yeah you used to live in this area as well.

C

B

It's already causing you know massive amounts of panic. I I haven't been out of the house in like a month and a half um wow. Thank you covet. So aside from uh you know, going on taking the dog outside and going for a walk and stuff like that, so I can only imagine that the grocery stores right now are completely out of bread and milk, because you know milk salmon.

A

Everybody needs milk sandwich, yeah.

B

So all right, um I won't uh due to my my tardiness. I won't waste any more time with the uh uh the the small talk that you and I can do at any time. You know right. We do work on the same team after all. Yes, indeed, um so I don't think that there were any follow-ups from last week. I don't recall any I I know there probably was that I am forgetting um I'm trying to get better about.

A

B

Notes to myself to actually cover those things, but um yeah it was funny. I did a uh a ask me anything session for our field yesterday and came away with like almost a dozen questions where it was wow. I don't know the answer to that. I'll have to follow up and I'll have to help out. So let me get back to you on that. One.

A

B

uh Which is also interestingly, christian is uh who is supposed to be our guest today is in today's session or the today's version of the same thing. I did yesterday so we'll see how it goes for him, so I did just see a slack pop up right underneath your name here that it said andrew was right from christian.

B

um So new things this week um and I I specifically wanted to call out in case uh like me: you filter emails from red hat, um which I know sounds strange as an employee, but you know we, we have a bunch of distro lists and one of the noisiest ones is the errata and and security one, because every time we release senarata, I get an email about it. Even if it's a product, I have no interest in. um So very importantly, we announced a cde this year. Yes, um so I'm going to you know, I.

A

B

I'm ready for this one.

A

Oh yeah, well yeah. I've already got it in my pocket, so it's been, you know saved there, but yeah.

B

I'm gonna share my screen as well. This.

A

Is a gnarly one folks you're going to want to update sudo at the very least like right now.

B

Yeah, so I I know you know this is the openshift administrator office hour, sure and at first glance you think well, sudo openshift core os, so coreos does have sudo in it. And, yes, even though everything is deployed as a container, we still rely on the security mechanisms of the underlying os to be intact right. If you deploy a fips cluster, it relies on, for example, those you know: fips encryption libraries at the os level to right to provide that functionality.

B

So if you scroll down into this security advisory- and we get down here to the potentially impacted products- you'll see this red hat openshift, container platform, 4.6 etc. So what I would say is- and I don't have any additional information beyond this um or what's inside of here.

C

B

Wanted to make sure that our audience here is aware of this and and is paying attention. Thank you. um So a couple of things make sure that your container images are up to date right if you're, using ubi, if you're using a rel image, you know make sure that you uh update those regularly and continuously update those as possible.

B

um Two keep an eye on the openshift update channels.

C

B

I strongly suspect that we will see um uh uh a an update for coreos or an update for openshift, which will include an update for coreos to address this, but I do not know the time frame for that. I.

A

Do not either uh I know that there's a 4 6 13 release in the fast channel right now. I don't know if that includes the pseudo patch or.

B

Not it, it does not um so as soon as I can remember,.

D

A

I can do here. I know yes, so if, if.

B

You go to the cluster settings tab inside of your cluster, uh so you can click on the fast channel and.

A

B

It will come up with this 4.6.13., so there are release notes. So this is the release notes for 4612..

B

um However, if I scroll down here, there's a 4613 bug and security update, which you notice was released uh two days ago, so.

A

B

Before the cve was released, et cetera right and I looked through- I took a cursory list through this earlier this morning. I did not see anything related to sudo inside of here.

B

So typically we release these updates on a bi-weekly cadence, so every other week, every two weeks, so that would mean that 4.6.14 would ship andrew's guess is probably a week from this monday a week and a half from today right and I haven't looked at so there are nightly releases. I have not looked at um at those to see if it's been addressed inside of there, I'm not sure if anybody would want to run a nightly release.

A

No, probably not.

B

um There are some- and I scrolled past it up here for rel nodes. There are some workarounds and and mitigation factors that you can do inside of here. I haven't evaluated these to determine whether or not they are suitable for core os, but at a minimum please be aware: please keep an eye on the the information and, of course, keep an eye out for an update to openshift and which will include an update to core os to address those. The the security vulnerability.

A

Well, thank you for covering that um the biggest thing is just right now, if you have anything public facing with sudo in it, make sure it is patched and updated uh concluding container images. So please make sure you do that yes, post haste as they would say. Yes,.

B

Very very important, um so cool, so the uh so I've got two other things that have popped up um strangely uh more than once in the last week, and- and these are very random things sometimes so uh monday, there was an internal thread that was basically asking. How does a core os node get its host name?

B

Oh good, and yesterday a second thread came up that was saying. Why is my core os node? Getting the wrong host name? In particular, they were using uh dhcp and the.

A

B

Dhcp host was handing out host names, ah so when it when it dhcp'd does.

A

Core os even look at that.

B

Yes, so I I've done a fair amount of research. I'm waiting on some responses on the engineering side to to see if I'm I'm right or not so.

A

I mean, I know core os response to dhcp. I didn't know if it picked up the host name or not.

B

Yeah, so the the core of it, um and actually I forgot to grab the link to the code where I had this up. um Let me stop sharing and see if I can grab that, because it's on the same.

A

B

Yeah, um so the the core of what we're talking about here is when there's a service inside of coreos that runs on startup that basically uses hostname control to set the host name nice and it pulls that host name from proxis kernel or proc kernel, sis or wherever inside of there yeah. So the question is: how does that value get uh or how is that value, assigning.

A

Assignments, yeah.

B

So for and I I will say I have only done inder in depth. If I could speak research on vsphere but effectively, it is pulled from the virtual machine name. Okay,.

A

B

There and there's a bit of a hierarchy here, so if I'm doing ipi and I use ipi to deploy via machine set new virtual machines, what's going to happen, is the machine set and machine api will create a new vm? You know: hostname cluster dash randomness dash worker dash 12 right right. That is the name of the vm and vcenter when the vm powers on vm tools is used to determine that name and feed that first hostname that it uses interesting.

B

However, that can be overridden in a number of different ways, so one would be if I have something like ignition that runs right and puts something into etsy hostname one would be if I provide the ip equals and then you know ipad address.

B

Basically, I'm setting the ip address at the kernel so not applicable for ipi, but definitely applicable for upi with in the case of vsphere another one would be if I have dhcp handing out my ip address, so the host says: hey dhcp, give me an ip and it says I think it's dhcp option. 12.

C

So it says you know: hey.

B

Here's your ip and your host name is this right, so that will override that setting as well and then last but not least, is reverse dns.

B

So, for example, if I am doing uh dhcp and not doing dynamic, dns updates so core os.

C

B

Like rel and and most of the other linuxes, if it has a host name when it dhcp sends the dhcp request, it will also send its host name with it for ddns to do an update if it ignores that update, if there's something that's already existing there and etc, so that the reverse dns is something different.

B

That name will take holds right, so that will that will override it right. So why is this important uh so a couple of different reasons, so the biggest one particularly with ipi, is it determines which csrs to auto approve based on that host name.

B

So when the node is created right, machine api creates a node gives it the host name, it comes up, it pulls down its ignition config and goes through the initial configuration stuff and then, when it comes time to join the cluster, there's an operator that says there's a csr for node named xyz. I see a node trying to join or I created a node named xyz. Rather those two match, I'm going to approve the csr nice. If they don't match, then it's going to say I don't know who this is. I'm I'm not approving that.

A

Madness chaos. What is this yeah.

B

Which makes sense right sure you know you don't want random nodes joining your cluster.

A

Yeah, yes, that's really bad yeah, so.

B

You know you you can go in and manually approve those csrs right. You know that would bring the node in. That would get it up and running uh long-term ramifications of that um not not known to me. um You know every time you do a certificate refresh. um I don't know. If it's going to require manual approval right, remember, openshift will main or automatically renew certificates for those for nodes that it knows, because that one has a mismatch um you know does does that apply.

B

I don't know so and of course, node auto scaling, which the whole point is. I want it to do it by itself. I don't I don't want to have to go in and tell it to do. It would now require manual intervention right if at 3am, when you know the everything's going haywire and it's trying to scale up from one node to 400 nodes. Now you got to go in and- and you know, get out of bed and come into the office and approve all of those things letting openshift do its job.

B

So again, long-term ramifications not quite clear to me, and I wish I could find this thread that I was looking for. Oh here we go.

A

B

It I don't know yeah. I still gotta find the right link.

A

Carnival music.

B

A

Here we go yeah all right thanks. I couldn't keep that going too long.

B

Ran out of notes, did you yeah.

A

I don't even know how that's gonna sound after but yeah, okay.

B

All right, uh so I will paste this in here and find our twitch here and paste the link in there as well. uh So this is, as we can see here, um or maybe you can see. This is probably a little small, I'm in the machine, config operator, github repo, specifically the 4.6 release and I'm underneath the templates common base units.

B

So the way that this works is machine. Config operator includes a number of files. A number of things based on the infrastructure that you're using and the deployment type that you're using so common in base effectively means that it's going to be included in every node, regardless of infrastructure. That's happening so real quickly. If we go to, for example, worker and select, you know I'll, do the zero zero if I'm deploying to vsphere it's going to include these.

B

So if I'm doing a worker deployment to vsphere right and it'll include this mdns and basically it determines. If this is a an ipi deployment, then it'll output, this data, as a part of that, so this is kind of how we do specific actions during the install process or doing the the node stand up process based off of the infrastructure and other other things that are happening there.

B

So, let's jump back to where I was before so in every machine that is deployed we'll end up with a unit, a service that runs that effectively just runs this setvalidhostname.sh script or it sources the script and then runs two functions, wait and then set valid hostname so that setvalidhostname.sh.

B

Is here and I'll open a new tab for that, and I will paste that into chat and I see some questions.

A

Coming into all the questions, yeah.

B

I'll finish rattling or prattling on about this and then we'll we'll address those. So this wait, local host, you can see all it's doing is saying wait until proxis kernel hostname contains something valid. I use something that is not localhost and then return back that that value.

B

So the question is: how does this get populated and whatever populates that and what or the last thing to populate that becomes what ultimately sets the host name? That is returned back for this particular host. So again that could be. You know something early on through vmtools, that's setting the host name could be dhcp.

B

It could be reverse dns right, there's a number of different ways that that can all be determined, interesting.

A

Okay, cool awesome, all right.

B

Are you ready for questions? Yes, please? Okay,.

A

uh First question: do we have any plans to develop a powershell module to administer openshift you're, a powershell guy yeah? I I.

B

Wish so I I am a powershell guy, um I I've I've been a powershell guy since the very first versions of of power cli with vmware um I I was one of the co-creators of um netapp's, powershell, commandlets um and and one of the advocates over there for him. So I I do. I am a powershell guy and unfortunately, as far as I know, there is no intention of doing that. I have done a little bit of research. You know in all my copious amounts of spare time of.

B

Could I take like uh you know the the kubernetes api, the openshift api and effectively auto generate some powershell commandlets powershell modules based off of that, um but I I haven't actually tested that or tried that. So I think there is some community-based kubernetes modules for powershell again haven't had time or haven't, had the opportunity to check those out and test those.

B

But I believe that there are some community modules for powershell and kubernetes, which remember openshift is kubernetes, so they should quote unquote just work. um One of my you know.

A

One of the things I.

B

Daydream about in my copious free time is having an opportunity to test those and and showcase some of the capabilities there, which I think is particularly cool if you're a powershell fan most of it is going to be net core, which means that it also runs on linux, nice, which is nice.

A

B

Plans that you know.

A

Of probably not going to happen um but entirely possible to make.

B

Yeah I I'll poke around um I'll see if I can poke around and find a jira issue on that.

A

B

A

um So did we finish the disconnected olm episode from a couple weeks ago, or is that the one where we hit the bug that we couldn't fix.

B

um We did finish uh so yes and no so update manager was where we found that it was not fully released yet, um but for olm and the rest of it. All of all of that is functional and should work as expected. As far as I know, and I tested it in my lab, I think we do have a live stream with christian and maybe myself, where we spent the entire hour or two hours covering that.

B

So we can I'll dig up that um as well. You notice I'm taking notes, so I don't forget these things uh included in the show notes. um So, if you didn't see um last friday, I think we published the uh the show notes, blog post on openshift.com blog, so any of the links and other things that we used last week, you can find inside of those blog posts, and this week will be the same. um I don't know I haven't talked with alex.

B

I don't know if we'll, if we'll publish those on thursdays or fridays or when that will be, but just keep an eye on it.

A

No, it's a very cool thing. You're doing, and I greatly appreciate it as it probably helps more people and as we do them more and more, it will help even more people as we go so yeah that'll be super cool, um so look for follow up in the openshift blog, which I just linked to next question from I'm gonna, try and say this one, because rap scallion reeves is just something that just rolls right off the tongue. uh Is there a way to control what node gets deployed onto which overt host?

A

Let's say I have three smaller machines and three larger machines can have the masters run on one small on the small ones and the workers run on the big ones.

B

Yeah, so real quick, um so I see it was killer goalie, who was asking about disconnected olm yeah? Sorry, if, if there are things that you would like to see or are missing, please reach out just let me know andrew.sullivan redhat.com and uh we'll be sure to specifically cover that um you know next week. If you let me know.

A

Yeah, if there's something specific missing, we'll.

B

A

B

D

I also thought I.

C

B

Ask if they can ask about openshift but not related to the stream, and yes, of course, you can always ask us anything.

A

At any time, this isn't office hours, so yeah feel free to ask any questions. As always, I need to have like an office hours short command I'll make up something real, quick.

B

So controlling node placements right, so this will apply as far as I know, to all of the uh ipi deployments and it is essentially, there is no specific mechanism to control where, in the cluster, a particular virtual machine lands, whether it's rev, whether it's vsphere, whether it's openstack etc.

B

So you can go after the fact and apply those rules, so you know create an affinity group for um uh create an affinity group for the hosts and an affinity group for the virtual machines and and it'll manage them. That way. um Actually, now that I think about it, I wonder if you could assign a group to the template that it uses and.

C

B

In the machine set, define so associate it with the template so that way, anytime, a new machine for that template is created. It's going to be automatically inherit that rule and therefore have that those placement options.

B

That's an interesting one that I haven't tested before that may be worth researching, so I'm gonna bring up my rev manager instance here.

A

uh Someone said damon set question mark, uh maybe no probably not for this.

B

So if we come to our and now I have to remember how to do this because it happens so infrequently right.

B

So I think if we go to.

A

B

Okay, cool uh-huh, so um this is my red hat, virtualization manager, environments. uh You can see. I just went to the cluster in my cluster name here and I'm looking at affinity groups, so I can create an affinity group and assign vms to hosts and I can create these rules at the same time on virtual machines and I'm just going to edit one of these virtual machines.

B

um I can with openshift or excuse me red hat virtualization 4.4. I can specify affinity, groups and labels directly in the machine definition. So what I'm thinking out loud um having not tested this at all, is I wonder if you could, for the template? That's used with ipi basically specify this information so that any vms created from it automatically inherit that so for masternodes. You would effectively because they're, not dynamically provisioned.

B

You would go in and assign this information as a day to operation through rev manager to assign them to the specific hosts and then for the uh the worker node. Each worker, node machine sets have a template that specifies whatever that affinity. Information is, if you happen to try that out. Please let me know whether or not that works.

A

Yeah that'd be interesting. I.

B

Would be very, very interested in that, and you know we could we could do a blog or something on openshift.com to talk about how to do that.

A

Yeah, okay, so next question is from islam.

A

What is the way to calculate sizing based on knowing how many pods we are creating, and I mentioned to them right, like yeah, uh you know it's very dependent upon the needs of those pods, but if you know like you have 500 pods, is there like a magic number for number of worker nodes.

B

A

Something like that, yeah.

B

So that's the that is the topic of today's session and I know that we're like 30 minutes in.

A

30 minutes in yeah- I'm just now getting here so this one may.

B

Span more than one episodes, but a lot of this came out of so I started some work last year.

B

I don't know october november of last year around creating a sizing white paper and all the things to take into account, which.

A

B

Now resulted in me um also doing a presentation for ibm fast start around the same topic. Oh lucky you um so yeah, it's a fireside chat. I think I was asking you about putting a fireplace in behind me. You know, like you've, got.

A

B

um So it really comes down okay, so it comes down to a couple of different things, and by a couple I mean it. It varies based off of what you're doing uh so. First I want to use- or I want to explain two two uh terms characterized and uncharacterized.

B

So a characterized workload is one that we know and understand, and we know what its resource requirements are. So, for example, I have a java application and I know it's only going to use at most one cpu right, one, one core or 1000 millicourse and the the jvm heap max is set to two gigabytes.

B

So I can authoritatively say this. Pod needs one cpu and two gigabytes of ram.

B

So with that, it's relatively straightforward to calculate you know how much cpu and memory am I going to need for my workload. I've got 500 pods, that's 500, cpus, that's one terabyte of ram right. So now the next step is so. How do I translate that into nodes?

B

And this is where it gets a little more complex yeah, it's gonna get real complex yeah.

B

So the first thing that we need to understand is what is: what is the maximum number of nodes that I can have and what are the kind of supported configurations of those nodes so in the documentation here- and I will link this page- we have our tested cluster maximums and be careful when you browse to this page, because there's actually two sections there's one up here that is tested maximums for major releases.

B

Where you see the four dot x tested maximum is 2000 nodes. But when we dig down into minor versions, you can see that the 4.6 tested maximum is actually 500..

B

So just be aware of that, I don't know- or I don't recall if this is, if 2000 is the supported maximum, not just the tested maximum or if 500 is the supported maximum.

B

I haven't read this page in enough detail or asked that question, so we we may just need to double check on that, but we want to look at importantly, the maximum number of pods per node and then whether or not there are any size restrictions or limits so, for example, continuing on down the page here you can see what are the aws instance sizes that we test with right, so things like how much cpu how much ram so on and so forth. This is not the list or the the only supported instance types.

B

These are just the ones that we test with. So essentially, what I'm trying to discover here is: is there anything that would artificially limit or change the number of nodes or number of pods that I have in my cluster right? If I've got a pod that needs a half a terabyte of ram right needs, a half, a terabyte of ram that can pretty dramatically change, how I size my nodes and how I interact with my cluster.

B

So, let's assume in my first example there one cpu two gigabytes of ram. It's pretty straightforward right, 500 pods easily fits within you know a reasonable node size, even though we wouldn't want to have just one node for availability, right purposes, etc. So now we can do kind of a mental exercise. So what happens? If I have two nodes effectively, I will have two nodes, each one being equally sized, so 256 cpus, 500 gigabytes of ram from an application perspective.

B

Now what happens if one of those nodes needs to go down? We patch something we update, something we change. The config and mco machine config operator needs to reboot it now that one node has to host all of the workload.

B

So I now really have to have two nodes and each one is capable of hosting the entire workload at any one point in time. So, let's expand it out. Three nodes: four nodes: five nodes: eight nodes, ten nodes, twelve nodes effectively. What you're trying to do here is figure out. What's the right balance of distributing the workload across the nodes in your cluster for maximum performance and maximum availability and maximum flexibility.

B

Flexibility here is an interesting one and is one that is quite subjective, so flexibility here could be well. I'm only ever going to take one node down four updates at a time, so the other nodes only need to have enough spare capacity extra capacity to accommodate that it could also be failure domain.

B

My failure domain, maybe I'm running in a physical data center. You know on-premises, maybe it's running in I'll pick on rev right. I've got you know four massive rev nodes. Each one is, I don't know eight terabytes of ram and you know 500 cpus and I could easily fit. You know. 30 of my open shift nodes onto those four hosts.

B

Okay, but what's the failure domain, because now, if I have one physical node that has you know 10 virtual nodes, I haven't solved that problem. I have to be able to accommodate that that amount of infrastructure failing at any point in time, so we have to be aware of those things we have to work with our underlying you know: infrastructure underlying service provider.

B

If you will to understand, what's happening there and be able to accommodate that at the infrastructure level, we also from an application perspective want to be aware of what those failure domains are if, if the application is architected so that there's a single pod, that is a single point of failure that could be bad right so and and then none of this planning around failure, domains et cetera, is going to be uh particularly useful, so I've, basically or in a nutshell, over the last six minutes, we've talked about okay workload, sizing, but workload.

B

Sizing is only one component of node sizing, and I see this chat scrolling, I'm not looking at it. Chris.

A

I'm answering islam's questions he's, got some follow-ups and there's.

C

More questions.

A

Coming in so feel, free to tell me when you want the next one: okay,.

B

So node sizing also has to accommodate not just the workload but the other things that are happening. So what are the other things that are happening so kublet itself? um You know the other kind of services so think things like csi.

B

If you have monitoring agents, so maybe you're deploying datadog or something like that inside of there, the openshift metric service. All of these things are going to consume additional resources on the host by default. Openshift only reserves- I think it's one half of one cpu.

B

um So if we scroll down here and I'll post this link as soon as I make sure it's the right one yeah here so platform tested cluster maximums and then so. This is the link that I just posted a minute ago, as of 4.6, half of a cpu 500 millicore is reserved for the system compared to 3.11 and previous versions.

B

So if you expect those system level right, open shift function or services uh to consume more than half of a cpu, you need to take that into account.

B

So in particular, metrics prometheus can be a huge consumer of cpu and memory on the host. Now, when does that happen, the more pods the more containers we have running on that host the more right efforts? Cpu memory is going to have to be put in by prometheus to collect all of those metrics and then serve them back up to the metric service.

B

So it becomes a little bit of a self fulfilling or what did we used to call a traffic trombone right if you've ever heard that term? On the networking side right, the more pods I put on the host the more non-application resources, I need on the node to accommodate the other things that are happening.

B

Don't discredit things like network and storage traffic as well, especially if you're, using iscsi pvcs and other things that are known to consume cpu resources at high throughput. I've got 40 gig going into my servers, and I've got all of these pods with a bunch of iscsi pvcs, and you know: they're pushing 30 gigabits of traffic. That's a lot of cpu, that's going into into processing those packets and doing the things that it needs to do so.

B

We just need to be aware of that plan for that accommodate all of that type of traffic, and, of course it's okay, if you don't get it right, the first time there's nothing wrong with that. This is the the beauty of openshift the beauty of kubernetes right. We can add nodes in at any point in time right, so we can kind of temporarily scale out with bigger nodes and then go back up and remove the smaller nodes.

B

So that way, we can consolidate those down back to the failure domain back to the number of nodes that we prefer. Okay,.

A

So all right cool, um I try to explain to islam how to do the math essentially uh for his worker nodes right. So he has two worker nodes wants to put the workloads on there. So those and those worker nodes have a baseline of system requirements right and then your workloads have their requirements.

A

Plus you probably want a little breathing room, just so stuff doesn't blow up. If something you know like a pot is being removed as another one's being added kind of deal. So I said 20, for example. Add that all together and you have your node sizing exactly.

B

A

B

I just wanted to add that that excess- you said you know 20 of extra overhead. That number is dependent on andrew's opinion. Two things. um One burst capacity for things like node failure, right um and two burst capacity for things like what you know, the the slash dot effect or the reddit effect, or something like super bowl ad.

A

Whatever yeah, where.

B

You just had this huge burst of traffic and how can I help accommodate that? How to actually determine that number is based on uh again my perspective, your ability to react to that scenario right. What do I mean by that? If you need, if you can react right, you know, auto scaling will take effect and it takes four minutes for me to get a new node up and operational and joined to the cluster and ready to accept workload. Then you need enough capacity to accommodate four minutes of burst right right.

B

If it takes you three hours, you might need more capacity. If it takes you three days, you might need more capacity. We used to deal with this when I was a storage admin all the time right, you know. Oh, you set your alert thresholds at 95.

B

Well, if I'm growing at x, bytes- and it takes me six months to get new hard drives in I'm going to have an issue in four days based on my alert threshold, that's not going to work right! I have to have this balance of how quickly can I add capacity and then work backwards from there to determine what my uh alarm threshold should be.

A

Cool so next question: how about sizing the the three node cluster, the compact clusters, right, like those one ga for bare metal yeah with four or five, I think so. Yeah and.

B

To be clear, that's the bare metal installation method, not just right. Physical servers, yeah yeah, so the minimums for compact clusters are effectively the combination of control, plane and worker node minimums, so the bare minimum for a control plane node, is 4, cpus and 16 gigabytes of ram, and that is if we go to installing and we go to excuse me bare metal.

B

I think it's in here.

B

Yes, so control plane, four cpus, 16, gigabytes of ram 128 or 120, gigabytes of storage, a compute node, two cpus and eight gigabytes of ram right and then bare minimum, so bare minimum is add those two together. So six cpus, 24 gigabytes of ram and probably a 200 gigabytes drive.

B

You know, storage drive for those compact nodes, but note that that doesn't include any workload. So, however, much application capacity, you need add. On top of that now that's two cpus and eight gigabytes of ram here. I think it's safe to always build on top of that, because you're going to have the metric service you're going to have you know those other things that are deployed inside of there that are consuming resources.

B

As well co-located co-hosted on those nodes, so if you don't have a dedicated infrastructure node to host prometheus and all that other stuff, you have to accommodate that. That capacity here.

A

Cool makes sense, and thank you for that. Jp dave says he got his 469 problems figured out. uh It was a csr issue, it looks like one of the nodes wasn't joined uh or doesn't have its certificates approved. So that's good.

B

Csrs are half of the bane of my existence.

A

Yes, yes yeah, so jbj says when in doubt do an occ get csr yeah good point. uh That is a very common troubleshooting step that I use uh are. Is everything issued right, um so the three node compact cluster you mentioned bare metal installation.

D

Method and then the.

A

Follow-Up question from inception x: was you mean you can do this on vsphere? Yes, you can.

B

So this is um so andrew has has issues that I know my team is well aware of. We've raised these of we overload terms when we talk about.

A

B

So ipi installer provision infrastructure is also called full stack, automation, upi also called user provision infrastructure or pre-existing infrastructure. Those are fine. Those are great right. We understand that there is those installation methods for all the various platforms bare metal is where it gets confusing.

B

So I tend to use when you see me, especially in written communication, I will refer to what the documentation calls and let me scroll up here so with the installation calls a bare metal install, including basically, all of these, the this entire subset installing on bare metal. I call these non-integrated installs.

B

You can use this bare metal or non-integrated install method to install onto basically anything. What it means is that there is no integration between openshift and kubernetes with the underlying platform.

A

That's the that's the takeaway from this right, like that's, the big gotcha.

B

So if you're deploying to vsphere and you use the bare metal or platform equals none in the install config, then essentially it's saying I don't know that this is vsphere. I don't care it's vsphere. I have no integration with vsphere whatsoever, so you can't use things like the dynamic storage. Provisioner right. You can't use things like nsx right. All that other stuff it is. It is infrastructure, agnostic right.

B

So this is the installation method that you want to use with physical servers that are not ipi, so bare metal. Ipi right is also the installation method that you want to use when you are doing a mixed infrastructure deployment. Some nodes are virtual machines and vsphere. Some nodes are physical servers right. I can't mix those infrastructure types and that is a kubernetes limitation, not right openshift limitation and I've hilariously, because this comes up. um I have this, this github issue bookmarked, I love it.

B

That's uh I just posted the uh the github issue into the chat. um So that's the github issue for kubernetes that prevents us from mixing infrastructure types in a cluster.

A

Damn it's still open too yeah, and it has been for a while yeah life cycle frozen, uh milestone, 119., okay, 121 is being worked on right now. All right so share this one out and get some more eyeballs on it.

B

All right, um so I see a question: uh is there a virtual ram? Do we support swap space? So this is a yes and a no. So generally right, kubernetes always recommends that you disable swap space if you've ever installed a cluster with like coupe admin or something like that. It'll say swap space is not disabled and, and it will force you to do that before you continue right. So why is this important? Because, yes, technically you can turn on swap you can turn on all of those other things.

B

Openshift virtualization has brought this to light. You know: do I want to turn on things like kernel, same page, merging, right, ksm to help um consolidate and get more over commitment of those resources?

B

So this is a choice that you have to make, but I can tell you why it is strongly discouraged in the kubernetes community and that's because kubernetes doesn't know when those resources are being over committed. So, for example, my host has 16 gigabytes of ram and it's struggling. It's hurting right. It's swapping out its rate.

B

It's sending memory pages to swap and application performance is just suffering, but using that swap space, kubernetes isn't aware of it, so it just looks at it and says: oh your memory pressure looks fine, yeah you're right, like 80 or 85 memory pressure, so it'll keep assigning workloads to it, which just exacerbates the situation right. It keeps getting new pods. It's it's masking this underlying resource contention issue right, so you want to be very careful anytime. You use something like swap or other resource over commitment technologies. On your notes.

B

It's it may not ever result in anything bad happening, but it could also result in very bad things happening right. So I'll. Take this a step further and say that over committing at your hypervisor level is equally dangerous for the same reasons.

B

Basically, if the hypervisor is you know, you've got you've, got your hypervisor node, that's running your kubernetes nodes. The hypervisor is way over committed, it's swapping or it's it's having to you know. Vsphere has cpu ready time right when it's waiting for time and can't get time on the cpu right, so the vsphere or the hypervisor is, is really hurting. Kubernetes doesn't know that right. So it's saying I need to auto scale. I need to auto scale.

B

I need to auto scale because the application you know, maybe whatever metrics you've got set up on the application, are saying everything's slow. I need to scale up. I need to make my application perform better. So you end up with this kind of uh whirlpool right this. This circling drain of application needs to scale up it's adding more resources. The underlying hypervisors can't accommodate those it's just making everything worse and it just leads to disaster.

C

B

The the recommendation I always make is, if you must use over commitment, make that over commitment happen as close as possible to the application right so with what that means is, whichever scheduler is closest to the application. In this instance kubernetes openshift. Let it handle that over commitment, don't do it at the hypervisor and kubernetes. Don't do it at you know multiple layers, you know so on and so forth. Right.

A

Cool, so we are approaching the top of the hour. um Let's see so somebody dropped ocp, vsphere, upi automation. The project is more easy to use in the ipi way is what uh our friend.

B

A

About the the three node cluster size says: cool yeah. So as far as csir approvals go is, is it kind of just a blanket rule? I believe up so killer. Goalie says I believe upi requires manual csr approval ipi will approve them automatically correct yeah. Okay, just wanted to confirm that.

B

Yeah there are scenarios um so, for example, um openshift. I think it was 4.4 4.3 or 4.4. We introduced automatic certificate renewal. So if you used an early 4.x version, you remember you deploy the cluster and then within 24 hours. It would rotate the certificate and if you shut down the cluster within that 24-hour period before it rotated this the certificate, and then you turn it back on after that period of time.

B

Everything would just be in chaos, and it was this long, complicated process of going in and reissuing and reapproving certificates and getting everything back up and running. So we fixed that it now does automatic certificate, rotation and approval and all those other things except when it's the cluster's been down for a very long time. I'm talking weeks, sometimes you will need to go in and just reapprove those csrs, even with ipi, yeah and reapproving those csrs. Basically to get this, the nodes joined again, resets that whole process and then it'll do itself.

B

So it's much much easier than before.

A

Yeah it used to be a bear and now it's a little bit easier, you're right, um so I'm there's a lot of chat. So sorry, if I missed something, uh there's one question: do we answer the about the scale up versus scale out?

A

We didn't ask that yet right. So, okay, would you say it's better to scale up worker nodes or scale out vertically versus horizontally scale. Up makes more sense to me, but scale out means. I don't have any configuration changes right. So it's just adding another node, for example- um and you know my answer to that was kind of like it really depends on what you're.

B

A

Right like what is faster in your instance right if you're on aws and changing you know like ram, is pretty the you know, interesting, yeah um and just quick, sometimes right, but there is your. Your system has to be able to acknowledge.

B

A

Increase in memory and put that into play, which it really depends on your infrastructure at that point, right.

B

Yeah, well, I would say both infrastructure and application, um and this.

C

B

And it is a strategy that can change over time, so maybe initially it's scale out if you've only got two three four nodes scaling up increases that failure domain right. Where effectively. I also have to keep enough. You know additional extra capacity on the other nodes to accommodate.

B

You know node failure for that burstiness or that burst of of new workload in the event of a node failure awesome. um So initially, maybe it makes sense to do scale out instead of scale up. On the other hand, you know if your application has fundamentally changed. You know hey. We thought that the largest pods were gonna have to accommodate.

B

Were you know, and- and I know they sound- an awful lot like vm sizes, because sometimes they are an awful lot like vm sizes, you know two cpus and and eight gigabytes of ram, but really you know after running for a couple of months, the app guys figured out that we really need 16 gigabytes of ram. You know that can dramatically change your strategy of hey. I still want to keep x number of instances of the application per node. So now I need to scale up scale vertically to keep my ratio in check.

B

So that's one thing that I have not discussed at all, um and this is a concept that was introduced to me in the storage world and it's called stranded resources wow. So with storage we hear of stranded resources when I have a iops to gigabytes mismatch right. So spinning media is really good for this, or is a really good example. I can have 10 terabytes on a single hard drive, but that hard drive can only deliver 100 iops.

B

So if I only need one terabyte of storage and it's consuming all 100 iops, I now have nine terabytes of gigabytes nine terabytes of capacity it's basically inaccessible.

B

I can't use it because I don't have the iops to deliver that right and flash media ssd and especially nv nvme almost have the inverse problem, and this is why you know particularly storage vendors that do deduplication, compression and stuff, like that see a big benefit from flash media, because it concentrates iops onto those that media and that media has much higher iops per gigabyte density so packing more of those. In is beneficial for the media, so the same thing is true, with virtualization, with kubernetes with openshift.

B

Right of I need to understand what my from an application perspective, what my cpu to ram ratio is so that I can then accommodate that in my node sizing. So let's look at the example. I just used right two cpus to eight gigabytes of ram.

B

So, if I need you know when I'm creating my nodes, if I deploy a virtual machine, that's so that's what one to four cpu to ram. So if I deploy a virtual machine that has eight cpus and maybe uh 48 gigabytes of ram right, that's that ratio is off right. I I'm going so four to one or one to four eight cpus 48 giga gigabytes of ram so eight times four is 32 right, so I would have a 32 one.

B

Four, eight cpu 32 gigabytes of ram virtual machine virtual node in my openshift cluster, to be able to effectively accommodate that workload with 48 gigabytes of ram I've consumed all of my cpu. But now I have an extra was that 16 gigs of ram that's basically inaccessible. As a result of that, so you want to be cognizant of those ratios and keep them balanced so that you don't strand resources accidentally right.

A

So jpdatas, wouldn't you spread the data across the nodes.

C

Or is that more.

A

Of an hdfs thing, which I believe like each node.

C

A

Have its own set of data or access to the same data set right.

B

um So I think it's gonna depend on the storage type right for one um so for talking. Ocs ocs does distribute data across the nodes right um I did and I'll I have the link up here. I know everybody's still looking at my stream screen, so you get a lovely picture of chris and I hey chris talking and me not paying attention.

B

um I call that a wednesday, uh so we did talk about storage or sizing storage for your nodes, so I'll include a link I'll post it here in the chat, real quick, but I'll also include a link to that in the show notes. So if you want to go back and listen to that episode where we talked about sizing, the discs that are used by openshift nodes to maybe help cover that.

A

Awesome cool, so we are at the top of the hour. We have another show coming up here in 30 seconds. uh Openshift commons briefing is going to include the team at kong if you're or wait nope, yes, kong. um If you're familiar with the folks at kong, they have the big gorilla uh logo, so yeah we're gonna be jumping to that here in a few seconds. So thank you all for joining. Thank you for your questions. Andrew, go back through discord, chat and see yeah.

B

And please um I don't know if you've got a thing for discord, please feel free to join discord. Ask questions at any time. Also, please. If you have a question that we didn't get answered today, follow up social media, practical, andrew at on twitter or andrew.sullivan redhat.com.

B

um Definitely, don't don't hesitate, don't uh don't think twice about sending us a message.

A

B

A

The discord you can ask a question anytime and someone will randomly come along and get you an answer. It's pretty cool, yeah yeah all right! Well, thank you. Everybody uh see you here in a few.

C

C