Kubernetes VMware User Group, 3 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes UG VMware 20220203

Description

February 3, 2022 meeting of the Kubernetes VMware User Group with discussion of recent updates and patches, resource limit declaration and admission controllers, monitoring of resource usage.

A

Hi welcome to the february 3rd meeting of the kubernetes vmware user group, where we talk about best practices and just making things work when you're running any form of conformant kubernetes on vmware infrastructure.

A

On the agenda today, I added a coverage of a recent updates to software, which includes both um the vmware infrastructure products like vsphere, as well as the open source kubernetes uh um like the cloud provider and the csi storage plugin.

A

um We had an informal chat before the meeting started and I think miles will go on the agenda in a bit and talk about possibly amending our recurring meeting time for maybe better convenience to the people who tend to attend, and I also noticed recently that there were a number of conversations scattered about slack, not all in the user group channel, but some of them were in the cloud provider channel. I think some might have been in the tanzu community edition, but all related to mapping kubernetes zone features to running on things like vsphere clusters.

A

So maybe, if we get time, we can chat about that. If people are interested, I put in the chat the link to the agenda notes document and I'm going to share my screen just so. I can share my thoughts as I go over that so just a minute, as I try to get the share running.

A

Okay, I assume you can see that agenda notes, document and somebody okay, so recent updates, um both esxi and vcenter, were recently updated.

A

Recent meeting january 27th, the links are there in the document.

A

This addresses some of those things that caused at least some of these components to have been had bugs sufficient that they were recalled from being available on download, but new ver new, improved versions are there and available and strongly recommended on the log4j subject that we talked about in the last meeting.

A

Vmware did conclude an investigation and found that at least a couple aspects of this weren't exploitable in vmware products, but the authoritative links are there and there are a lot of vmware products, but this one is pretty much just talking about the infrastructure related to kubernetes. Those links go the full span. So go look over there. If you're interested the csi storage plug-in was updated. In mid-january, there were features added for those still ringing on vsphere, 6.7 and.

A

Go look at that link to go, get it and read about it. uh I believe there were also features related to supporting upgrades from the old entry storage plug-in to the csi version.

A

There were a number of doc updates on the storage plug-in and in particular these are dock updates that specifically relate to you, if you're not using the vsphere with tons of commercial product, if you're, using upstream open source or other vendors distros, I think what you might find there would be perhaps new information or things that are better documented now than they were before.

A

The documentation related to csi with zoning also had a recent update, and I went there and took a look and it looked like pretty useful stuff.

A

Bug fixes if you're on csi 2.3 for the cloud provider, the health chart version one was released and there are docs over there on how to use it. Cloud provider itself had an update that related to being able to configure ipsubnets a little better. I know the person who asked for this feature in particular was using cube vip and had um asked for some enhancements related to being able to use it a little more flexibly, so that might be of interest to people. The alpha of version 1.23 was released in mid-january.

A

This, principally, would be useful if you want to move up to the newer version of kubernetes at the cloud provider meeting yesterday, the dev announced that they expect this release by february 11. as usual, that just an estimate it could come in a little earlier. It could be later because things might happen, but just wanted to let you know, what's going on then miles. If you want to bring up the subject of a new meeting time, you can go for it I'll stop. My screen share.

B

Sure um so, whenever we originally formed these group kubernetes vmware user group, maybe a year and a half ago or so now, maybe a year, I don't remember- we had a poll on what time slot to hold it in and at the time a lot of the folks that were coming to the user group regularly were based in west coast and u.s, and the vote came out to be. I think it's 11 a.m. Pacific right, um steve, that's correct!

B

Okay! So um in the last while it appears to be more and more predominantly emea based folks, so maybe it makes sense for us to have another poll with some alternate time slots and see how that comes out just for the folks that are regular attenders that are based outside the us. If it doesn't change, that's fine, um but I figure it's worth having another poll, because it's been some time now, since we asked.

A

That sounds good to me. I'd second, that it maybe suggests that um it's, you probably better, add a few more selections, but we might as well ask the people on the call now for their suggested times. I don't think you want it free form, so maybe just have an assortment out there.

B

Yeah yeah um I mean do people prefer like during work hours in a time or like morning time early afternoon or after work is usually the best.

C

I'm I like this time slot already: okay.

B

Fair enough.

D

I also while I am in israel. I do prefer this later time because it doesn't work for me during my work schedule. It works for me at night, so sure no problem.

A

What was the time you'd prefer miles, centurion.

B

um A little earlier than 7 pm, okay, but yeah, I'm not precious about it. If, if it works for for everyone else, that's cool, I'm! Okay with it.

A

Yeah well, why don't you add? Whatever your preferences are and we'll just see what happens and, of course, it's kind of unfair. You know we're not going to do it verbally in this meeting, because the people who can't make this time probably aren't here so we'll just right I'll.

B

Throw it into this I'll put it in the slack, then.

A

Okay, then, um let me see if anybody tacked on anything else in the agenda, no, that that was it. um We we do have an opportunity in this user group to submit a maintainer track talk for kubecon and that will be due in around a week or so.

A

I think- and I was thinking of putting together a proposal for a talk on using the event notification system built around built around k native that william lam and michael cash talked about in a user group meeting about nine months ago, because it seemed to draw a lot of interest, and I know that additional activity has taken place there, but I'm open for other ideas.

A

If people have them on things, they might like to see at the kubecon event, uh we could maybe have the event uh system and add a second thing or just go with that, but just putting it on the table for discussion.

A

Okay, I'll just mark that down as no objections and no other ideas, so we'll probably go forward with that, and I've talked to michael and he seemed he seemed amenable to attending physically to give a presentation on that subject. So that sounded pretty good to me too, um for those who weren't aware that the conference is slated to be physical in spain in mid-may.

A

uh The other subject.

B

I just assumed it was going to be netherlands again.

A

No, it was uh valencia.

C

That would be nice, it might make discussions here a bit easier. I'm sorry, I don't know I'd like to go kind of, but uh things are still a bit wacky around covet and yeah same here and um it's hard to justify going there at the moment within itunes.

A

D

C

D

Going to happen.

A

It seems uh to change on a week-by-week basis where I live it. You could even like alter your view of the world by news source and get a completely different impression of what's going on in your own city.

B

So, especially in the u.s yeah.

D

No I'll be going, and hopefully my cfp will be accepted as well, but I'll be going for sure. As long as covid allows that.

A

C

What talk were you were you planning to give for scott? What were you hoping for.

D

uh I'm doing one with someone from the tce team on carvel packaging.

C

How surprising exactly.

A

And then the other thing that I didn't put it in the agenda but, like I said I I don't know if the rest of you noticed it, but there have been a fair number of questions recently. uh Some of them might be by newbies, but people trying to run kubernetes on top of vsphere in light of and taking into account clusters even moving things around.

A

It seems to be sort of a recurring issue and it seems to vary quite a bit.

A

I mean some of these people have been enabled by recent open source uh releases so that I think that I I looking at the uh descriptions of the thing questions they're asking I'm almost thinking, they're more on home lab, like environments rather than production grade commercial, but still we'd like to support all of these, because, particularly when you go out to edge locations, the fact is, even in enterprises, sometimes these edge locations end up looking closer to a home lab than a big data center.

A

So I think some of these issues are pretty generic to a lot of people, and I gathered that some of these people even had situations where they're on they have multiple esxi's, but they don't even potentially have shared storage readily available uh with them, which I mean I. My own attitude is as long as you're, not on persistent uh uh workloads running on kubernetes. It probably is just fine uh to not have any shared storage.

A

uh If you need to support a persistent workload like a database or something you know, there might be hope if you're for hosting that on kubernetes, if you use the right type of storage solution and maybe had three available physical nodes, but going down to two, I'm not so sure that you're not better off just running those persistent workloads in a vm with vsphere features to attempt to achieve availability rather than adding a kubernetes layer on top of that mix.

A

But just one person's opinion- and I know miles you're kind of more authoritative in the storage arena than I am. But.

B

Yeah uh when it comes to edge, it gets really hairy, especially whenever you're talking three hosts or less, because you have no room for any maintenance whatsoever. um So uh this is actually uh what uh robert and scott- and I were talking about. um You know maybe doing some uh work on or doing some kind of presentation on at some point uh to do with like izzy's storage kubernetes, how they work or don't work more critically together. uh Why not to stretch storage and yeah?

B

Basically, just why not to do things rather than why to do things um but yeah whenever it comes to kate's on vsphere the simpler, the better.

C

I think um I think we could do a whole session just on storage um easily, um so.

B

The uh I have done lots of them.

D

Yeah you can do multiple sessions just on storage. It could be a series just on storage, yeah.

C

It's um so so, maybe nice, for you, know steve the um and anyone who's watching, um uh probably uh me scott miles and uh some other guy from itq.

C

That's also thinking about some of these problems, we're going to do a tanzania tuesday session, where we continue that conversation about um about how to uh yeah kind of make these two worlds match up when it comes to resource and and storage use, um thinking of resource use, in particular, um it's funny because I speak to a lot of guys from from infrared backgrounds and they're concerned always, and it's not not kubernetes related per se. It's it's that whenever other people are running high workload, high resource use things in their esx clusters.

C

They get a bit worried. You know they don't like the black boxiness of a whole bunch of vms, pulling a huge amount of resource x and there's this kind of this strange um contention around responsibility or a feeling of responsibility.

C

I I've yet to meet a visa admin or a team responsible for vsphere that was happy, leaving an esx cluster. Just you know letting letting it go and saying. Well, apparently, today it's running at 90 ram. We don't care we're not responsible but whatever's running at it's the problem of the people running it even even if there are very well delineated agreements about who's responsible for what.

A

So when you say just letting it run like, they just ignore any ability to impose resource constraints and just yeah see what happens they do it by design.

C

Because these fabians for 15 years have been trained to not let that happen, yeah and and to take responsibility and to make sure that be here stays healthy, is x, stays healthy, all the stuff running on stay, selfish and, of course, partly they have to, because you know, because if the cluster is going to go, 100 cpu vsan is going to fall over and it's going to fall over all these things going to fall over. um So you.

D

Have to kind of somewhere that's why there is a best practice in terms of like in general, the community is basically aligned in, like a four to one ratio of you know over commitment and resources in many cases in standard use cases and in kubernetes like a 1.5 to 1, maybe 2, to 1, but never more than that, because you don't have the control in vsphere and just because of how workloads are actually managed by cube.

D

Scheduler um is that you just don't, go over a 2.1 ever in your ratio of v cpu to cpu and ram to ram, and then you're, usually fine in those cases, but that's also waste of resources. Right. You don't.

C

Get the boardwalk so these recommendations. Where are you getting these from? Because I've never come across? Anything like that in this video, the white paper.

D

That was put up on openshift mentioned this, and there was one on pks in 1.0 of pks that hasn't been updated since 1.0, but like best practices of running on vsphere um and then just from overall, like community, uh like talking with people and seeing what organizations are doing and and things like that, but there's nothing official out there. That says this, because no one wants to come out and say: yeah get half the resource uh utilization out of your environment.

D

Only do two to one ratio, doesn't sound good, but when it's written down, when you explain it, it makes sense in terms of how the scheduler works and that you actually have over provisioning in kubernetes with requests and limits. If you set them correctly and things like that, but it's the double level of over provisioning, that's always wrong. uh I.

A

Almost wonder that any organization that would kind of run kind of loose loose on vsphere that venn diagram, probably overlaps with people who aren't using the resource limits things in kubernetes too. But that's just speculation. Exactly.

D

For sure no, but it's the same issue that we had with like data stores that are thin provision. Where, like you, provision your distance and your net app is in provision volumes, it's like great job. You have no idea how much is actually used or how much is allocated or what's going to happen tomorrow.

D

It's this double level of management has always been an issue, and it's been known that it's an issue in storage and there's best practices kubernetes, it's just not official that there are best practices on this, but it's the exact same issue. Here, it's compute there is storage, but it's the same idea over provision at one level.

D

Don't over provision, it two levels that don't have any communication with one another to tell you or to understand and correlate what is going on.

B

I think it's there seems to be an element of have your cake and eat it too, from people that manage a vi infrastructure that happens to have kates on it as well, though, which is um they want to get the absolute maximum efficiency out of the platforms? That means when stuff is run in steady state.

B

You know they're running 80, uh cpu utilization ram utilization, it all comes along nicely, but you've got absolutely no room for burst or you know workloads rolling out like if you suddenly, you know nuka cluster, and then you run argo cd and you set up the entire stack all over again. It's gonna have a bad time doing that so yeah there's there's. Definitely an element of people want to be too heavy on the resource consolidation. Just because of that's how you've always done it, and it was really easy because it was steady state before.

A

Yeah, the other thing that could put you over the edge is, of course, a failure, and it could even be a software failure, not necessarily hardware but I've. I've heard horror stories of people running cluster, aware persistent apps like cassandra that go across clusters and you can host those on kubernetes, but if they ever trigger themselves to think they need a rebuild or a node replacement.

A

Those things can get pretty ugly, often evidence this hasn't come up yet but store. Networking is another aspect that traditionally in uh on the vsphere platform, you could hear mark and carve out resource allocations in your you know allocating the physical network bandwidth to particular workloads or particular designated uh traffic, like storage versus whatever compute you're hosting, and it was always views as pretty wise not to have a big burst in compute.

A

Take out your underlying storage, because that could lead to a positive feedback loop, and I think that layering kubernetes with these cluster aware persistent apps can do that exact same thing. Some of those things on a rebuild can really get ugly like demanding 1020x the normal baseline amount of network connectivity.

A

Should they start trying to rebuild the node from scratch.

C

I also think there's something there's something new here, because I mean the the fact that whatever vms are doing, you know, kill your clusters. We've known this for a while. The thing is the chance of many vms all doing something. At the same time, there are only you know, a bunch of scenarios where that can happen either you're already running some kind of distributed workload.

B

Exchange or something yeah, yeah or.

C

Or you're having like some kind of reboot storm or you know the the virus scanner you know go went crazy on on. You know, like you know, a thousand horizon vms at the same time um or or you get a you know, some kind of failure which causes a massive hay storm um we've seen that before so or drs storms, you know all kinds of stuff, but the kubernetes I mean I've, never seen um I mean the kinetic is.

C

Is this this amazing control plane to run any kind of distributed architecture on top of and um the chance of getting these kind of effects are so much larger with this just during normal operations, because the whole thing is a distributed model and everything you do with it. Has that aspect to it.

C

You know, if I, if I'm going to do like a blue green update of some application architecture, I'm going to hit so many vms at the same time, because of the nature of the distributed architecture and that's a sea change, I think when it comes to how resources are utilized on the layer and right underneath it.

B

I think there might be an element as well of I have paid for these features in these vsphere licenses.

B

I am going to use every single feature because I paid for it, and that includes drs, and you know fully automated drs in particular, which, if anything, is a hindrance to running kubernetes on top of vsphere, because it you know, unless you do some intelligent work with affinity groups and hosts and and vm affinity groups, it makes it really really difficult to have a deterministic topology all the time and there's a lot of the stuff that you know.

B

If, if it were me that was running kubernetes in prod, I would just turn them off on vsphere and try and make the vsphere layer simply run. My vms leave them alone. Let them sit in place and like kubernetes, do everything else, because you you're going to have two control planes fighting with each other, the entire time, otherwise,.

C

Yeah, it's a conclusion. I've reached very slowly. You know over the last two years, I'm still quite new to this stuff, and this is just with the tgi. um But this is the same. You reach the same glue just start turning stuff off because you don't need it. It gets in the way, but there's another there's another aspect to this and I saw a tweet come around just. I think it was this week earlier this week, which was very interesting, um which I think I retweeted it it was.

C

um You can simplify the infrastructure layer and then do kubernetes bunch of clusters and you leave the intelligence to that control, plane and hopefully to the developers deploying apps if they are intelligently deploying apps and building apps, and there was this great tweet. It was like the top five things that most kubernetes developers are still not doing properly like basic stuff. Give you know, like a part, you know a limit live in this channel. It's resources, yeah start basic things like this. um Don't.

D

Set the limit to ten times the requests yeah a lot of people.

A

D

No so yeah, my request is one gigabyte. My uh limit is 100 gigabytes. Your machine has 20 gigabytes yeah, but I'm setting the limit to 100. What does it matter? It matters.

A

Yeah well, I'm active in the kubernetes iot edge working group, and I have to say that my observation there is that people who grow up in a massively elastic public cloud, it's almost like the platform, isn't actually going to coach you on putting these constraints there. They just assume you sprawl out and gobble down a bunch of extra vms and run up your bill.

A

People who evolve using kubernetes in that environment and then in later years, trying to move to edge with having gotten away with never declaring any resource allocations or limits um are in for a pretty bad year of learning the hard way why that stuff is important, and I think that that yeah, some of these same lessons apply, I think, when you potentially move to vsphere and you're paying for your own hosting and kind of have a finite amount of capital.

A

You you want to invest in your compute resource and not have it be open-ended. I mean once you go on-prem, it isn't a public cloud where it's just a matter of getting worrying about it. When the bill comes a month later, you can just always be guaranteed. You get what you want more or less instantly and yeah.

B

I wonder how much of that you could remedy with like an emission controller inside kubernetes that requires you to have you know limits and requests set on all pauses to get deployed. I mean it's pretty simple solution. Oh yeah,.

D

I set that up for a customer where I actually did the best thing, because we wanted them to set it. We just said great: if you didn't set it, we didn't want to deny the request. All we did was limited their pods to 10 megabytes and one millicore. Nothing.

C

B

They're like what.

D

What's wrong with the system, I said nothing watch and I bring up an engine xbox I set requested limits and everything is great. They say well. Why is that I said well, you're not sending requests in limits, they don't know, there's a validating web hook in the back end. That's doing this. All they see is their pod and it's getting 10 megabytes and one millicore.

A

D

A

D

Next step, though, on.

A

The chessboard of looking ahead, two or three moves when they all lie and say they need a thousand well, I guess if they said they need a.

D

We have a validating web hook as well. You can build that says if they've said it above, a certain number. It's not allowed and like things like kyverno, for example or opa, can do the same. You can have a config map in like cube system, let's say with exclusions.

D

So if someone were to have a specific workload that needs more than what a normal workload should need, they talk to the platform team, who would add that into the exclusion list, allowing them a certain amount, but otherwise for a standard workload. You want self-service deployed to the cluster awesome you're limited to whatever is decided in that organization based off the types of apps they need.

D

What's the max cpu and memory that should be set and what the ratio of requests and limits should be, and all this can be done with jms path in uh kyverno or in rego. If you so feel the need to learn another language that makes no sense uh in opa.

A

D

A

Could you maybe drop a link or spell it for me, so I can look it up later.

D

um It's the easiest policy manager for kubernetes by far because it's all yamo based, so you don't need to know a programming language and the real benefit is that it also has like it has both the validating web hooks uh like opa, has opa just added mutating web hooks, so it can mutate requests as well, um but kyverno also has what's called the generate um policies that allow like any time. Let's say a namespace is created automatically create these other objects.

B

Oh so like image, pull secrets and stuff like that, too,.

D

That's what I use and they actually have a way where you can generate an object or you can create a clone of an object from another name space and it will keep them synced. So that's what I do for image posts secrets. I have an image: pull secret in cube system, any new namespace that gets created. It creates it in that namespace and then it automatically syncs it. So if I need to change that password, I just change it in the cubesystem namespace and it gets replicated within two minutes to all other namespaces.

B

That's nice at the minute, I'm using a custom controller to do that. That has its own crd, you update the crd and then it updates the object. It's a bit of a hack. To be honest, that seems like a nicer solution. It's a feature.

D

Gen or something else or a custom controller.

B

No, no, it's you know, alex ellis's reg, regrets controller that one yeah.

D

Okay, yeah no kyverno replaced about nine different controllers that I used to have.

B

D

B

Have you looked at pulling back the resource available from a host, so each host obviously exposes how much cpu and ram it has? Could you build an admission web hook then, based off of as long as the total number.

B

D

Sorry, you cut out there for a second on my side. What was that right.

B

So, um essentially, could you total up the number of resources from each of the nodes in the cluster and then have an emission controller and a mission web hook? That then says as long as the resource that's being requested by this pod is less than what's left in the cluster admit, the workload.

D

You should be able to do that with jms jms paths in kyberno as well, because it has an api access to any kubernetes object that you can pull in and then use that and save that in variables and use it in next steps, and things like that, so you could uh do. That would be my guess. I've never done anything like that, but you probably could.

A

D

It got rejected what would.

A

Your idea be miles to sort of try to admit, it'll, alarm notice or something that.

B

Well, no, I mean, if it's an emission web hook, then I guess you could have it fail or you could get it to default like like what um scott was saying to some ludicrously low value, so it admits it, but then it won't start up and then they'll have to go debug why? I won't start to start up so either of those would be okay as far as it.

A

Just strikes me that it's a situation where it's based on things not having to do with this particular workload. You know what was already there before I tried to run so it wouldn't be repeatable or deterministic and you'd, without leaving some breadcrumbs to explain what this happened and when somebody might be clueless as to the behavior going on.

D

Right so what the validating web hook does is you can set it into warn mode or fail mode, um and then it will in either case even in warn mode. It will print out to the console of the user. What the error that you're throwing in the policy is, so this policy failed pod created if it was in worn if it wasn't and it was in failed mode, it would just throw that error and wouldn't create the object. So that's the benefit of doing it through a in mission controller.

D

Doing it through mutating can be very beneficial um in terms of the user experience, but it's less visible to the developer, what's actually happening. So it's kind of a way of don't make your developer do things because you just mutate it with sane defaults.

D

um On the other hand, uh you know it is kind of behind the scenes and they don't know the magic that's happening, um but that's actually one of the someone who used to come a lot to these meetings. Chip zoller from dell um he's actually one of the maintainers of kaiverno and has been doing amazing work on all of the policies out there there's, I think, over a hundred policies. uh Example policies on their website um for everything and like really good use cases as well and they're active.

D

They have an active channel on the kubernetes slack as well.

A

You know I'm just brainstorming here. This isn't really an admission controller, but um my thoughts are that looking ahead, people might have a tendency to exaggerate their resource demands and as a policy, you might want to police that and it's a sort of a situation where you have to let them run, but leave a reminder that hey in five minutes or an hour, I'm gonna check on this and just compare kind of what they declared. They were going to use to what the historical actual usage is to go catch liars and you might.

A

This might be a great basis for something that I think would be really popular. You know the whole green movement is popular of eliminating waste and if you were to couch this as something that would support a greener planet by re, reducing wasted resource like over provisioning, compute and power usage to more match, you know I don't know a target 80 uh loading of what you spent money on uh this could catch on, and I'm wondering what tools might be available to build. Something like that out by putting together existing projects.

B

You could do that with prometheus. I would imagine so not the modification part, but you could pull back the metrics for a given pod say over. I don't know a week and then just say we'll take the 90, 90th, percentile or 95th percentile and we'll just modify all the objects to be 95th percentile for their resource quotas.

A

You're right so then, if you.

B

C

A

Keeping a record of this workload or this you know you could even do it by container saying you know based on history. This is the range within x standard deviations of what this thing actually utilizes.

A

Then this uh you know when this thing gets rescheduled, you could look at what the declaration is and compare it to its actual track record and if there's a massive discrepancy that could issue a warning or you know some some kind of a mutation too. You.

B

Could have alert manager to you, don't even need to modify the cube api objects directly. You could just have an alert manager rule set up that evaluates those criteria and just bugs the person that deployed it and just say, you're being greedy. Stop.

B

D

The one issue is: is it's the same issue that we have with like tools that try to monitor the traffic of an application, then generate network policies off of it? um Things like what the rni and network insight can do on like vms. um It doesn't work well in kubernetes from my experience, because the average lifetime of a container in kubernetes is so low and the tag changes, and they made a small change to the code that anything like this really needs a learning period. It needs to do inference and it can't do that.

D

It can't infer anything in the amount of time, because by the time it's collected enough data, it's irrelevant, because there are already four versions ahead in the container um in terms of production right, so it works well, possibly at the beginning or people that are just starting with kubernetes, and that may be still on the same release cycle of two three times a year, um but the people that are really moving into the cloud native world in organizations that are doing that and doing rapid iterations on their deployments.

D

It just never has enough time to learn to get accurate values. Do you think it.

B

Could work at the deployment level, then scott, because you know that spins up a new replica set each time, but the deployment itself would be uh a consistent identifier, at least.

D

It it's so it's a consistent identifier. The issue is, is what do you do, then be between versions because versions of applications. All of a sudden, you change the version of go from 116 to 117 cap controller.

D

Just did this and it lowered by like 20 their memory usage um just by updating the go version right, because that was actually something wrong in the going version they were using, but, like any of these, things are so specific that you could have a version that also jumps up um or it goes down all of a sudden because they brought in some new library.

D

um It just becomes very difficult. It's not that it's impossible. It's just the numbers you get here are not nearly as accurate as things like. Vr ops is uh or vr, and I are for, like compute and networking in the traditional monolithic world. um It's just unfortunate, it's unfortunate, but it's a fact of how you know. We've kind of moved along in this world where things are just so short-lived.

A

I suppose your historical database could just have enough information collected so that you collect version numbers and things like that and throw it out. If you know your data's been invalidated, maybe also throw out anything that is so short-lived that it doesn't matter, but right, uh it would be.

D

Great, if you could catch even things, need a really good ai model yeah. If you had a good ai model that miles built, that would like basically correlate it understood the versions of a deployment and then understood what the standard deviation is between each of those versions of going up and down between versions of the container being used, it's possible that it could predict what the general like number of versions that you create are and what the standard deviation changes are between each of those versions and then accordingly build it.

B

Just thinking from what you said.

A

Oh you're audio caught up a.

B

Problem like say, you run the python app and someone just does traditional python library management, import tensorflow. All of it just give me everything, and you know suddenly it's five six hundred meg heavier in ram usage for that one little container I think it'll catch it would catch outliers, which might be useful. You know if there's anomalies, but it's definitely.

B

In libraries to things.

A

Yeah, I think you know it's not just things like your example of drawing in too much through the python tensorflow, but man if you could flag things like uh container images that came off docker hub that got corrupted. You know, there's been a sad history of bitcoin miners managed managing to slip things into databases and things, and if you have this kind of monitoring going on, you should be able to catch those anomalies where yeah the version went up but gee.

A

It's sure peculiar that when the version bump happened, it's burning twice as much resource for for some reason, and maybe you don't shut it down. But if you had a system to report those things as warnings or notifications that might be really useful, not just for greenness and preventing resource waste, but maybe catching things like security risks that are, I don't know. Ransomware would be another thing. Maybe you could flag by this.

D

Yeah and there's a bunch of there's a bunch of tools that do that actually really well with like sandboxing and things like that and different proprietary tools out there and whether it's vmware's carbon black for containers or things like aqua security, prism, there's really good tools that do that, like runtime protection.

D

um I think that the one interesting thing also that kiverno does is that it has um they just added. I don't think one six has come out yet I think it's still in the release candidate phase, but when it comes out, they added image.

D

Signings validating web hooks to make sure that things are signed with, like a cosign certificate, to make sure that things actually aren't that no one intervened in the middle um and that's something that's big with like docker hub, if you're using public registries, um sign your images and make sure that they are actually the same when it comes down, because that's usually where a lot of these people are getting in they're, not getting into the image necessarily they're faking the image on the way and doing things through.

D

You know different manipulations along its way into your cluster and.

B

So secure supply chains are all our age. These days, I hear.

D

Oh really, I wonder why.

B

It's not because I just started working on it.

D

No everyone's been talking about it. Six store did a great job of pushing that out there, and everyone else did and yeah it's great to see. It finally happening, though, that things are finally starting to come out to handle a secure, a secure supply chain.

A

Okay, well, I think we kicked around a whole bunch of ideas here related to this, this that maybe could spawn off future activity. um If, if what you were talking about of doing this whole storage thing is something you need a extra person on I. I just think it would be a learning experience for me.

A

So if you need an extra body in there or even want to run by a draft of a presentation or something, let me know, and also in terms of a forum to present it, it could be at these user group meetings or something else. uh I don't know what you had in mind there, or even you know that this is almost such a broad concept, that it could be a whole series of presentations blogs, whatever I think and could go in user group meetings, uh com, physical conferences, online conferences or whatever.

B

Yeah, that's why we were looking at the what we are doing uh tanzu tuesday uh march sometime. Isn't it robert march 22nd or something yeah.

C

Yeah, but it's actually it's just it's more, it's more just to talk about it and see if we can structure our structure. Our thoughts on on the matter.

A

C

I mean I do I I would like to structure it in some way into the form of. Maybe you know a talk or something and just take a part of it like the storage thing, um but it's uh but yeah, it's a huge, it's a huge subject. I find it personally a little bit overwhelming all of the aspects of it.

B

It's one of those things that you would give to o'reilly and there would be three co-authors on that type thing. It's massive.

C

But I think there's a huge need for this kind of guidance to be out there, um because everyone's everyone's struggling with this.

A

Yeah and having it go public you prob could provoke discussions that would maybe add even more ideas to the table and things that people hadn't thought of so yeah, I'm I'm in favor of getting it out there. So let me know if I can be a part of it.

A

Okay, does anybody we've still got 10 minutes left if anybody has any other things for the agenda? Otherwise um we can end this early. So.

C

Well, I have a question that maybe you guys can help me with um it's related to the csi, so.

C

So I'm still quite unfamiliar with how you do things like metric collection in kubernetes and how you you know, can scrape things with prometheus, um and one thing I noticed recently was that the the csi apparently now um exposes a low balancer service. um So you with a prometheus. um What's the word for it instance.

B

Or service monitor.

C

um Yeah well a thing that produces prometheus ah an no. I think it's that yeah and- um um and I thought well that's interesting, um and then I thought I thought just today. I think I saw like um this was popping up in other places as well now, up until now, the various kubernetes distributions that I've seen, which are obviously vmware ones. um They don't do this yet and but now tkgs does because it incorporates version of the csi.

C

Does this now the problem is it's not documented anywhere at all, except in the csi project, uh and it's kind of weird, because you know: do kubernetes and there's this low balance entry and it doesn't even work with default. um So I'm like what is it? What can you do with this? So that that's kind of my question like what's that? What's it actually for how are you supposed to use endpoints like that?

C

Why are they now switching to service type load balancer? How did it work previously? I mean if I, if I wrote, a tce cluster today, you know that same pod is there, but it's cluster ip. It's not a load balancer. So how what's changed that this is now being exposed externally.

B

C

They're doing this.

B

So it doesn't need to be a server-side load balancer, but it is service type low balancer in tkgs, because uh the thought is that people want to run external prometheus instances somewhere else on the network, and it would not be in that cluster because, where you're seeing service type load, balancer is in supervisor or in tkgs- and you generally not have your prometheus instance installed there. So it can't be cluster ip and there's no point making it node port. So it's service type load balancer.

B

um What it is is essentially uh key value pairs right. That's that's! How prometheus does its metrics collection it? It's a fetch based system. It runs every 15 seconds by default at prometheus instance. That is, and it will search whatever endpoints it is given.

B

Now you can give it endpoints in a number of ways by a crd with service with a type of service monitor, and you give it essentially the service name, the name space it's in, and it will go scrape that it'll, discover and scrape that endpoint right and the what is exposed from csi is just you know, uh like g c underscore memory underscore alloc equals and then the number right, that's that's kind of the level of stuff that you're getting there might be number of pvcs and the amount of storage used.

B

I I don't know I haven't looked at what metrics are in there, um but it's just you know plain text, and then it gets scraped every 15 seconds by prometheus, so you would have to set up a service monitor or either a static target, and this is what prometheus calls them is targets, so you're setting up a target to scrape and depending on how you install prometheus. You could do that through the ui. You can do it through kate's itself, because there's api objects to do that or use their operator.

B

In any case, you set up a target that says I want you to scrape this ip at this port at this path and the default passes path is slash metrics, so if you go to slash, metrics you'll see all of it um and that will just pull those back and they will then be stored in prometheus, and you can run queries on them over time.

B

That's essentially it it's it's it's quite simple, but um getting into the prometheus stuff and figuring out how to get like new targets into it is a bit mind-bending to begin with, especially if you've never played with it. Before, like I did the ml uh demo. I think here at one point where it was like uh recording the number of frames per second that were being processed by the gpus, and it is it's very non-trivial to get that kind of stuff set up if you've never worked with it before.

B

But the things that you need to care about are service monitors targets and the default path is slash metrics. So if you hit that ip on slash metrics, you should be able to see it.

D

And the only other thing that is very important is that, if you're using the tanzu prometheus just because whether it's tensor community edition or a commercial um have fun uh it's in a config map, uh so you have to edit a manual config map and add it in the prometheus method.

D

You can't use a service monitor, that's part of prometheus operator, which is the the fact a way that most people are installing it today uh in the community, because it gives you the ability to set up alert rules for alert manager and service monitors and all and scrape jobs, and things like that all through crds. um So I would suggest going down that approach, no matter what, because it's very hard to manage through a config map. Even if you know prometheus, it is not trivial and have fun reloading uh the pod. Every time.

B

All right, you're gonna, have to have a config map based reloader for the pods and all kinds of stuff. So just just use the prometheus operator if you're interested and if you're gonna stand it up, I would suggest the easiest way to do that is use. The helm chart called cube, prometheus stack. That is absolutely one to get started with.

D

And it comes with like a hundred dashboards built in and it's really absolutely awesome.

D

Cube prometheus stack is great and there's a few of them that are great out there. That's good, there's, also the bitnami one isn't bad for prometheus operator because it includes thanos which q prometheus operator does not. So if you need large, large-scale or long-term saving of data which.

B

One includes thanos.

D

uh The eight includes the thanos sidecar, uh no.

B

No, no, which which one.

D

The bitnami prometheus operator, okay, uh because I just had that actually by a customer who needed to keep their metrics for a year and a half. Why regulations? I said: there's no like it's not logging, they said yeah metrics need to be kept for a year and a half never understood. Why.

D

But yeah so they use that one, because that way they can get thanos in there easily.

B

And robert uh yeah, that is, the right helm chart and I just sent you my values file for that config map or that helm chart that I use live. So it is run on my arm cluster. So there's some image customizations just whip all those out but use the rest as a reference.

C

Cool okay, thanks, that's good advice.

D

Do you own blah.cloud,.

B

I do bladder cloud is my blog and my home domain and everything like that. Yeah, that's a great domain yeah! I got in there real early whenever they made those tlds available.

D

Yeah, I got one in right after that, also because I started my days in vra, so I just did v rabbi dot cloud and that's my domain for everything.

B

Yeah yeah same here, I've I've gone through about four or five domains over the years. I think I'm just going to keep this one going, though, because it's nice and apathetic and that's kind of where I'm at very nice.

A

So we're at noon pacific time so um last chance if somebody has a very short topic. Otherwise, let's close this one and resume in a month.

A

Okay, well bye everybody. It was a great conversation as usual and uh we'll look forward to the next one.

B

A

Yeah thanks guys.

C

B