VMware TGI Kubernetes, 29 Apr 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: TGI Kubernetes 189: Quotas and Budgets

Description

Join Evan Anderson to learn more about managing Kubernetes resource usage in clusters with tools like KubeCost, LimitRanger and Quotas. These tools can help bring visibility into resource usage by different teams in a cluster, and enable better resource usage and sharing of cluster resources.

A

Hello, everyone um welcome to tgik um at one o'clock now sorry about anyone who got an early notification that I'd gone live. I was just getting obs set up and I forgot that when you say start streaming, it kicks off youtube immediately.

A

Yeah we've had a few weeks off, we've been talking that we might make tgik more of a sort of a monthly occurrence rather than an every week occurrence, because um there's a lot of stuff going on in the world and as exciting as kubernetes is there's a lot of other things that people have life going on and so um showing up every week, um finding a new topic and doing all of the production and post-production and so forth, um starting to get it to be a little bit of a load.

A

um Hello uh looks like most of our folks right now are from europe in one place or another, and um we will be talking about.

A

We will be talking about kubecon eu shortly, among other things, so um I'm going so I'm hoping to see some of you there. If you didn't see any of the various communications from the cncf masks are still required, so um we're going to uh you know we're all going to see each other safely. um You know this is going to be my first international flight in quite a while, so um I'm a little nervous, but looking forward to it.

A

So um uh for those of you who've forgotten how this works. Usually we start with a little review of what's going on in the general kubernetes ecosystem, and this week we already talked about the kubecon eu.

A

If anyone um has particular you know, oh days, you know here's this day, zero thing, that's really awesome. um You know feel free to chime. In there we um uh virtual tickets are also still available. um I forgot to check if non-virtual tickets, um if physical tickets are still available or if that's sold out.

B

Let me stick a link in there.

A

Let's see it looks like yes, in-person registration is still available, so.

B

Let me stick that link.

C

In case you oops.

A

In case you're, thinking of going, um I'm looking forward to seeing the folks who are there um and the kubernetes release team for those of you who haven't been keeping track, um releasing kubernetes and getting a kubernetes release out on time with the thousands of contributors- and you know, probably about a dozen or so sigs, each of whom have features and caps and so forth, going in is um a pretty complicated process.

A

But if you're on one of those things you're not doing it alone, there's a team that shepherds the release from the beginning to the end and so they're keeping track of the timelines and the caps and what features are going to go in and which features aren't um and the way that this works is there's an experienced team there. But they're always looking for new people to jump in, um and you start with a shadow rotation.

A

So you um are following along, but you're, not the person who's critically responsible for getting a certain feature done, um and so you get to learn how all this stuff works and learn the ropes. And then, after that's done, you get to. You know you're you're, part of the release team and you get to you know, work on a release and say hey look. I was one of the folks who shipped you know, 125 or 126 or wherever it is, but they are currently looking for volunteers.

A

And so I don't have a link for that one either. um If someone wants to put a link in go for it, otherwise I'll slip. One in afterwards. um The big news that I saw in the larger case ecosystem um is that istio has applied to join the cncf. For those of you who um weren't aware is a project that was initiated by ibm and google to build a service mesh using envoy as a substrate and um was announced several years ago.

A

It's open source, but the trademark and other components have been owned by google and a year and a half or so ago.

A

Google announced that they would be setting up a new foundation to hold those trademarks um for reasons, um but they have decided that those reasons didn't make as much sense as they thought and um instead istio is going to join the cncf um and be a regular cncf project which hopefully, will make it a little bit easier for groups that um that liked the istio idea, but felt a little bit nervous that it wasn't in a neutral foundation and didn't feel like they could contribute um or that their needs might get ignored in the future that they could now feel confident that that's going to be.

A

You know, that's going to work for them um and I'm sure that we'll be seeing a whole bunch more announcements in the coming weeks. As kubecon comes up, I'm guessing that there's a lot of um dry. You know a lot of folks working hard to uh get stuff ready and with that, let's.

A

Let's dive in to um cube cost and the general problem that these different tools that I promised we're going to talk about um are here to solve so um and if you were following my my twitter earlier um we're going to be doing this on a cluster, that's been running for a while, because I forgot I had it running, and so um it's been sitting around um paid for my employer and uh had a demo or two running in it.

A

But the demo wasn't needed anymore, and so it was just um just sitting there eating money.

A

And now we're gonna look at what's on there, we're gonna load some extra stuff in and then um based on that we'll see a little bit about how these costing and budgeting and quota tools work. um I've used kubernetes quotas before um a couple years ago. I set up a lab environment for a kubecon event, so we had um three.

A

We we had about 200 or 300 cores. By the end um we under provisioned at the start, which was a problem, um but we had about 100 people, learning to use k-native and building stuff with tecton and so forth on a cluster, and we gave everyone a quota of three or four cpus, but when they weren't using it, canadian would spin things down so we'd slightly oversubscribe the cluster, um and you know we used quotas and limit ranger to say.

A

Okay, people can't use more than this, so that might be actually a good place for us to start is.

A

So kubernetes has this mechanism called resource quotas, and the idea is basically that it lets. You say: hey this namespace isn't allowed to use more than a certain amount of resources and this acts as an admission controller. So um because it's an admission controller, it only happens when you go to create a resource. So when you go to create a pod, um it will keep track of how many resources this namespace is already using. Add that pod on and see if it's above the limit.

A

And you can see that it's enabled by default.

A

On most clusters and there's a flag that indicates, you know what it is.

A

And then here's the different.

A

Basic resources that you can set quotas for so cpu and memory for both um requests and limits. So you can say all the all the pods can't have limits of you know if you add all their limits together, you get 12 cpus, but you can't have limits higher than that, which will mean that that um namespace can't ever use more than 12 cpus, even if there's empty resources available in the cluster, um because that's how limits work they're that hard upper limit there's also requests which is used for scheduling.

A

So you can say: hey you can't request more than 8 cpu, but if we don't have a limit enforced, you can burst above that and if everyone's in the cluster and trying to burst above that, they'll probably end up getting proportionally about what they expect. But um you can run into some surprises. Then, where oh, no one was in the cluster.

A

I thought I was fine, but I was way above my requests um and someone else showed up and now all of a sudden I was using more than I needed, but I didn't know it, and so now, my when my when my available cpu drops then my requests, you know I'll still be getting what I requested, but I actually requested less than I needed um it looks like huge pages.

A

Are another thing I know some apps are, you know benefit a lot from huge pages for those of you who don't know what those are on linux and x86. Your normal page size is about 4k and you say four kilobytes, but my machine has 64 gigabytes of ram and yes and the way that intel handles this and amd handles this and lots of other systems handle. It um is that you have a table that says.

A

Okay here are two gigabyte chunks of memory um and then inside those two gigabyte, chunks of memory, we're gonna break them down into like eight megabytes, smaller chunks and then those down into 2k into 4k chunks, huge pages. Let you use those two gigabyte chunks contiguously, and so you don't have to go through two extra translation layers which can really speed up some applications. I think jvm likes that quite a lot, so.

A

You may end up needing to tweak that if you're focusing on performance, I've never needed to tweak it in my own kubernetes usage, but it's there if you need it, but you can set also on extended resources like hey. This namespace can't request more than four gpus now the funny thing about gpu resources is, if you have mixed types of gpus, it doesn't care, it just says: hey it's a gpu.

A

I don't care if it's an itty-bitty, gpu or a big great big gpu you're only allowed four, um so you have to use um node affinities or the like to manage the rest. um Oh this is. I had not realized here as well.

A

You can also um set limits on storage so and you can do it by storage class, so you can say hey. You know we have some highly replicated, or maybe some ssd, storage and you're only allowed.

A

You know 50 gigs of that or something like 50 gigs of ssd, but you're allowed 500, gigs or without or a terabyte of spindle, because that storage is cheap. um You may be, I o limited.

A

I actually don't know how those how that I o sharing would work, but, um but you might be- I o limited, but you're allowed huge amounts of that.

A

um Oh and you can also have quotas for how many so you can't you know you can't just go and create, say, 5000, config maps and use that for storage, um which would be expensive on the cluster, because each one of those config maps is a separate std object. So you can say: hey.

A

You know we're not going to let this namespace create more than five config maps. So, let's get started just by testing that out.

A

And let's use config maps to do it.

B

C

B

Is not what I'm looking for.

C

I'm looking for an easy example of viewing and setting quotas.

A

So this is one of those things in the documentation that I uh I always find interesting is there's a whole bunch of sort of conceptual information about here's, the stuff it can do and then there's a separate thing of. I want to use this. Let me get started using it real, quick and sometimes those are far apart. In this case, there happened to be a link that wasn't.

A

That was in the second section.

A

Oh, but there's a walk. I don't know this. Walkthrough is a different thing.

B

A

See what this walkthrough is configure cpu and memory quotas for a namespace, okay,.

C

So, let's see oops, I'm gonna do this. I'm gonna pop over here.

B

B

A

So we're going to apply this resource quota um to the small namespace.

A

um So this is actually an interesting um case where you want to be a little careful with your r back, because the resource quota is a resource. That's inside the namespace, and if you give someone like edit on all resources in the namespace, for example, um they may have permission to update their own resource quotas, which is probably not what you want.

B

A

And so you can see here that um here's a set of api groups, but we only have get list and watch for edit right here, so that's relatively safe for the edit role um and that's the only place where resource quotas show up. So it looks like by default. The edit role doesn't actually have edit on the resource quotas in that namespace um and.

B

Now, let's see.

C

uh Create a resource quota create a pod.

B

A

So I'm actually going to do something slightly different here.

A

I'm going to create a deployment.

B

And let's see what else metadata.

B

C

Let's see if I.

C

See if I remember how to do that correctly, no.

C

A

Okay, so we've created a deployment.

A

And we can see that we've got one container now: let's try cube control scale.

B

C

Deployment quota men.

B

B

C

A

And so we can see that um we've asked for five, um but.

A

We only get one because we don't have quota for more than one with the resource requests and.

A

Let's see um so, if we look at the conditions on this deployment, we can see that it has a replica failure. um Well that's hard to read, but it has a replica failure error because creating a second pod exceeds the requests available.

A

um So since we want to look at resource usage and stuff later, let's um expand this resource quota, so we've got enough space.

B

What did we call this thing.

B

B

B

A

So you can see here as well, actually here's an example of kubernetes model working for accounting. So in the spec it's reporting what you know, you're you're saying these are the hard limits that I want. um I think that there may also be soft limits where you can warn people if they're above their, if they're above a certain threshold.

A

But uh you can see that in the status it re repeats back the quota that it's currently enforcing and also reports. The usage that it's calculated, because um it can be hard to figure out.

A

Oh, it doesn't make any sense to make requests higher than limits.

A

um It can be hard to figure out what the controller's thinking, but if you echo it back into status, then there's a place where the controller can show what it's thinking and also can kind of pick up what the current um state of the world is, and so now, if we get the pods.

A

There's just one um we may need to poke the deployment, um because it's given up it said: oh, I can't create pods and it may come back after a while, but a quick thing you can do to get it to realize. That faster is to patch.

B

A

A

It looks like maybe there's only hard quotas. um Someone was asking if there were soft quotas too, and it looks like there's no soft quotas at the moment. um If that's a useful feature for you to be able to warn people before they hit their limits, um a kepp would probably be the right place to propose it. um uh The um the validating admission web hooks, which is probably what this is using.

A

If certainly if I was building this outside of core kubernetes, um I would use a validating admission web hook and they do now have the ability to provide warnings as well as errors. So you can say: hey I accepted this, but you should know you know hey. You know. You asked me to install a resource quota and it's less than your current usage.

A

You might want to know that all this current stuff is going to keep running.

C

A

A

You you proposed, uh it looks like um maybe uh demis uh proposed the uh the soft quota and then something changed. um Let's see so that's still at a small, so we're gonna cube control patch.

B

A

So what you can do is, you can just add a metadata field like an annotation, I usually call mine bump.

A

C

A

ah I had the name wrong: um that's what happens when I type it out, but by by bump, by changing the annotation, I've changed the data in kubernetes, and so the controller is going to have been watching that and said, hey, there's an object, update and so it'll get to it faster. Otherwise, when it's retrying something like this, it will just do a back off and maybe it'll visit again in 10 minutes.

A

Maybe it'll be two hours if you're managing everything where eventually it needs to be in the right place, then that's great if you're hoping that right away, um you know you fix something and you'll see a result. Sometimes um you need to go and patch the resource with some sort of no op operation to make it clear that it needs to go and change things.

A

um So that's a little trick for you all.

A

um So the question that was being asked next was: how do I find the right limits for the resource quota, um and this is where this goldilocks tool um that I just heard about recently, um although it's been around for a couple of years, apparently um works. This is using the kubernetes vertical pod. Auto scaler, um in suggestion mode, so for those of you who aren't familiar um kubernetes shipped with something called a horizontal pod, auto scaler and the way to think about that is basically um assume your kubernetes applications scale.

A

You know every time you add a new pod, you get one more unit and they're all basically equally powerful, so you look at how much resources you're using and you basically grab it, and you just stretch- and you know you get 8, pods or 12 pods, or something like that and then you're using less and you just kind of squish it back down, but some applications don't scale. Well that way they don't. um You know. Adding another container adds some.

A

You know contention overhead or maybe there's even a unique resource that you need so the vertical pod, auto scaler, is a different tool um that basically looks at um how much resources. Is this pod using and suggests?

A

Hey here's a better pod size for you, so um you get as small a gap as possible between your requests and and potentially your limits and what you actually use, and this will track over time and sort of try to figure out okay. What is your? What do your peaks look like and make sure you're within that peak?

A

um And so then goldilocks is a tool which creates a vertical plot autoscaler for every single um deployment that you've got and sets it in recommendation mode. So it doesn't actually change your pods at all, but it tells you hey. The size of this is between x and y, and then you can manually go in and update it.

A

Part of the reason it explains this here um is that the horizontal polar scale and the vertical pod autoscaler don't work well together.

A

If you have one thing: that's trying to figure out sort of how tall your rectangle needs to be in order to get all the work done, it needs to get done and you have a second thing. That's kind of making your rectangle wider and narrower you kind of end up doing some of this stuff um and your box gets to be funny sizes rather than sort of smoothly growing and shrinking, because neither one knows the other is doing something and they don't coordinate.

A

um But this is saying: okay, we're going to use horizontal, potato scaler for sort of immediate reactive stuff, and we use vertical pod, auto scaler every once in a while to set new values for how how high that rectangle is and then day to day. We just stretch that rectangle.

A

um So, let's see about getting that installed.

B

B

C

Installation here we go.

A

We need to have vertical pod, auto scaler installed, so vertical pod, auto scaler is not installed by default. This is one of these additional things um which is part of what's so exciting about kubernetes. Is that it's extensible enough that you can add a new auto scaler to it without needing to do work in the kubernetes core.

B

A

Command, oh, that's, exciting, to install download the source code, so it looks like this is something where the packaging is, maybe not quite what you would be used to. So, let's see.

B

Clone what is this kubernetes, auto scaler.

A

Okay, so yeah we, it looks like we're going to to clone this, we're going to see what tools are needed to install vpa um and uh then uh you know 130 megabytes of source code. Later we can go into autoscaler.

C

And let's see vertical plot autoscaler.

A

And this is under hack, of course,.

C

Apa vpaup.sh.

C

And then, let's see what.

A

Else: okay: let's see what that script actually did now that we've done it, um it calls vpa process yamls with a create argument.

B

C

A

C

A

So print help if it doesn't have an arg just one argument: um if it's delete diff or print, we add another component to this list of components.

A

And then we run some more scripts to generate some certificates and.

A

Really so it appears that this script is vpa process yamls, dot, sh is a plural and then there's a script without the s that we call for each component um to actually install it by uh processing the yaml and piping it to cubecontrol, unless we say print so it looks like we should be able to do this and see all the yes. So this is all the different stuff. That's getting installed.

A

Let's see 300 lines of yaml we're not going to read it all.

C

A

If we go back and look at this, it looks like there's a deploy directory with some yamls in it, and then they do some post-processing. So.

A

Let's see, oh they've got a customization interesting.

C

Vpa I'm curious here's the recommended deployment.

A

And that looks like a deployment. I wonder what the process yammels does then.

C

A

Oh, it allows you to switch um what registry and tag you're using it looks like, and then it will sub that stuff in. So this is a place where they've decided um the easiest way to do. It is with a said, rather than using customize or helm, or something like that.

A

Okay, so now we've got this on the cluster and great. Now we're gonna go back over to what we really were trying to do to install goldilocks. um Okay, we've got some workloads with pods.

A

Do we have metrics server.

A

B

B

B

A

We do have metric server so.

A

It would be nice if the documentation linked to how you get a metric server, if you don't have one but metric server is used by kubernetes um for the hpa, as well as the vpa, to keep stats on.

A

On the hosts in the cluster, so that you know.

A

You know how many resources the pods are using and then, when you go to horizontally vertically pod auto scale, you have some measure of how much is being used, that you can use to sort of threshold things um and they've built their own here, because, um yes, you could use prometheus. Yes, you could use datadog. Yes, you could use um google stackdriver or new relic or the fact that there are 50 choices meant that they felt like they needed to create another choice, because no one would like one of the choices that they picked.

A

But it's sort of a small limited thing, so it's just for connect collecting pod usage metrics rather than um more general like okay. How is you know? How many requests am I handling and how many errors am I serving and all of that sort of thing?

A

So um you can see that this is what metric server is intended for. The horizontal and vertical auto scaling in the cluster. Don't use it for anything else. Basically,.

B

C

A

Oh and they've provided a helm chart silly me: okay,.

B

A

So and they say hey by the way, you don't actually need the update or admission web hook. um You only need the recommender piece.

A

The way that vpa works is it's keeping statistics on each pod in the cluster under some grouping where it can recognize deployments and so forth, and keeping that history and whenever a new pod goes to be created, it will go in and intercept that pod creation request and rewrite the requests and the limits based on what the recommender says should be done.

A

So um the updater depends on the recommender, but this is just going to use the recommender without the updater um and it suggests that prometheus may give you things that are more accurate.

B

um Installation.

B

C

Okay, we've added that helm, v3.

C

And now, let's see.

B

Determining resource quota limits.

A

I'm not sure I quite understand what choco's asking about here. um Vpn goalie locks help you determine the right resources for your workload.

A

uh You should probably be setting resource quotas based on something like a budget or resource forecast, rather than just the current amounts of resource usage in the cluster. um So you should say you know hey. um You know. The cluster is size x, we're willing to use up to size y for this application, um and then you should set the resource quota based on that and it's fine. If applications are bull if a namespace is below its resource quota, were I running a cluster?

A

I would track sort of the aggregate resource usage to figure out when it was time to resize the cluster and then use the quotas for individual teams to track them against their projected resource usage and if their projected resource usage, you know suggest that I'm going to need new hardware in a month. If it's a physical hardware cluster, then you know I better start ordering that hardware. Now I can also track. Oh hey. Everyone said that they need they were going to need altogether.

A

You know 300 cores and 2 terabytes of ram at this time, they're actually under that or over that, and then I can use that to sort of adjust that when do, I need to resize the cluster question.

A

Apparently, I need to create the namespace first.

B

A

And now, if I were really being careful here, I would have put a resource quota on it, so I knew how much space goldilocks was going to take in my cluster.

C

Get the pod name.

A

So it looks like there's a dashboard and they're suggesting that you find the pod that is hosting the dashboard to port forward it, um but instead I'm going to realize that this is a linux. Shell and I need a windows. Shell and.

C

Apparently my powershell is broken.

C

uh I bet that broke in the windows 10 upgrade or the windows 11 upgrade.

B

B

Let's see appearance font, size.

B

uh Transparency of 200, okay,.

C

B

C

Let's see make sure I did that correctly.

C

So we should be.

A

Able to do this cube control port forward.

B

Let's see, name is goldilocks dashboard.

C

And we want 8080 to port 80.

C

Okay, we've got our port forward going so now, let's see.

C

A

Oh so now we've got it installed. We can see the dashboard, um but.

A

uh Let's see here we go um so now. We need to pick an application, namespace and label it in order to see recommendations.

B

C

We'll do that.

C

And we'll do that.

A

So we've labeled the small name space and the goldilocks namespace.

C

A

So in goldilocks.

A

um We've got a controller and a dashboard pod or deployment, and they don't have any recommendations yet, um probably because stuff just started running.

C

Let's take a look at this and see.

A

So here's a vertical pod, autoscaler and it says: hey I'm targeting this deployment, don't do any updates and it's providing a recommendation.

A

And right now it looks like it's recommending not much cpu and memory, but you can see the upper bound is still fairly high.

C

A

If we go back and we look in the small name, space we'll see hey right now, you're using 400 ml of cpus of quota, we think you could go down to using 25. same thing, cpu and memory. You could go down from 600 megs to 280., um I'm not sure what oh, um let's see, burstable qos.

A

So um this is different: kubernetes resource classes um for requests. So you can, um you can designate different pods to have different um cpu guarantees. So if you set a pod to have a request and limit that are equal, then you're basically saying hey. I guarantee I'm going to guarantee that this gets all the resources it needs and it's not going to use any extra.

A

um You can also say hey, this is burstable and you can set the limits higher than the requests and if you do that, um you'll get your requests, but you may get more than that um and your quality may vary. Your quality of service may vary depending on who your neighbors are, and it actually gives you a nice little yaml that you can copy and say hey, you know, use these settings um and it looks like it would actually suggest increasing the amount of memory we would burst to.

A

um It. Kind of looks like this is a multiplied by five recommendation here, but the cpu is higher than that. So I'm not quite sure um where those details come from.

C

But it looks like we can look at the details for all namespaces, maybe.

B

And let's see this is.

A

I'm not sure whether this web interface is slow because um the goldilocks system is under provisioned or something else in the network. That's not working well.

A

um But yeah it looks like we could just go and copy paste the recommendations for the small cluster, for example, for the the small deployment and, in this case, since we're running five replicas we'd save uh one and a half cpus and um let's see like three gigs of ram off the cluster.

A

So um this might be something that you want to enable and teach the people using your clusters about if you're, a cluster administrator and then you'll know that they will be.

A

You know that they'll be using their resources more responsibly if they're, using their resources more responsibly, that'll help them fit within those quotas that you've already set for them, and that can be one of those interplays where you say hey, you know we're giving you x, resources use them wisely and here's some tools to help you use them more wisely and then they're less likely to come back and complain.

A

Hey you didn't give me enough stuff, and if they do, then they can point and say: look I'm really using everything you gave me, because if people can't can't see what they're using and they don't really know, um you know, don't really know how to set these values and they can't really see the effects of them. Then they'll just say give me more.

A

It feels free and easy to me and then there will come a day when there's a crunch in the cluster and everyone's using all the resources. And then people will be sad if it's too oversubscribed.

A

So that is goldilocks and we've been doing goldilocks first, because.

C

Oh error, creating.

A

Airstream well.

C

We're done with that for right now, hi there.

A

As you might or might not be able to see, we have a little visitor here. My daughter is home from kindergarten.

A

Now we are going to take a look at cubecost, so cubecost is a different tool with sort of a different attitude towards this. So rather than trying to limit how much someone can use of your kubernetes cluster, it tells you how much it's, how much it's costing you to run the cluster and how much individual namespaces share of that cost. Is you can go back to people and say hey, you know our cluster overall costs two thousand dollars a month to run you're using half the cluster.

A

um We need a thousand dollars of budget transfer. Basically, and then people can say you know. Oh okay, here you go here's my budget um or they can say gosh we didn't know we were using that much. How can we reduce our usage and cubecost will give you a you know some answers on that so yeah, so it can break things down by deployment service, namespace label and so forth, and it looks like it will even work across multiple clusters.

A

If you look at the pricing, you'll see that for one cluster you can use it free and then for for larger businesses.

A

There's pricing that scales up based on how many nodes you have, but you can have as many clusters as you want, then.

A

And it looks like they will also let you build more reports and get better off and so forth.

A

um Is it open source?

A

So this is kind of funny. So this is. This is one of those places where open source and running a business um have a little tension, so it looks like the core model is open source, but if you install cubecost and run it, um it will, by default, call back to the commercial version and um or to the commercial product and leverage that commercial product for part of the value that they're giving you.

A

I don't know how easy it is to unhook the open source part from the commercial part. So sometimes you'll see um hey. We have this piece, that's open source that runs on your cluster, but we also have this commercial part and no one else has an equivalent commercial part so or an equivalent part open source or otherwise. So, yes, this piece is open source, but using this open source kind of locks you to our other commercial piece um in other cases, it's not that sort of cut and dried and yeah there's an open source one.

A

It just means that you'll be taking a lot of the work that cubecost would be doing for you and um you know going back to that pricing.

A

Is it worth you know, 800 a month to you to run your own, um your own measurement for those 200 nodes, or would you rather pay someone who's an expert hundred dollars and have someone to yell at if it stops working rather than just being upset.

A

And so let's see here is how so when you go to install it, um they solicit your email. I've already put mine in and you can see that they give you a token in the install instructions. So I'm assuming that, if you put in your own email, address you'd get a different token, and then there is this port forward, which we will try to look at the dashboard with let's see and again they're pointing at a deployment rather than a service.

A

And it looks like the cost analyzer and the prometheus server are both pending.

B

B

Yeah, I don't know why I started typing because.

B

A

Oh, it looks like this has needs a persistent volume claim.

C

Oh, my goodness,.

A

Helm values; okay, um I'm not quite sure what that is, but it looks like we need something with persistent volume, support.

B

A

So it looks like we have currently have no storage classes in this cluster, so.

A

Well, let me see if I can figure out how to get a storage class real, quick.

A

um This is running tanzu community edition on aws, so I might just end up needing to create a new cluster.

A

I'm going to get started on that.

B

In the meantime, because that might be faster.

A

This is one of those places where it's frustrating that um kubernetes has all these plugins, but there are no d, there's no like basic default for some things, like storage classes,.

B

Okay, let's see.

C

C

A

Oh good question: uh let's see, I just did this helm install.

B

Let's take a look and see what values are available.

B

C

We got a bunch of taji zips.

B

A

Well, it looks like they have been releasing regularly, but let's see if we can figure out where they've got their helm, charts.

C

Here we go: here's the cost, analyzer.

A

So it looks like I can.

C

Okay, persistent volume enabled false.

B

B

Oh, this is the wrong.

B

A

I forget the command to reset the values.

C

Okay, there we.

B

Go, uh let's see.

A

Let's see okay, there now we've got oh prometheus is still not running, because there's.

C

A second setting we need to set.

B

A

Prometheus server, persistent volume- um I think the other thing that's happening on it- is that my screen resolution is higher than youtube's. So maybe next time I will crank my screen resolution down.

B

C

Let's try that one more time.

C

A

Looks like things are starting up.

A

We will watch that.

C

A

So it looks like the well: the cost analyzer is running but isn't ready yet so it's probably waiting on the prometheus server.

B

C

We can always turn that port forward again anyway,.

A

Looks like it's willing to give it a try, looks like it's running on port 9090.

B

C

Okay, let's see, could not connect to cubecost.

C

uh At least it knows that, and let's see.

A

Okay, it looks like the cost analyzer and the prometheus server are both running.

B

A

Maybe that's a little more readable.

B

C

Cumulative cost for the last seven days by namespace.

B

A

So I think this is still I had installed this before, but I hadn't checked to see that all the pods were running, so I suspect it's still collecting information.

B

A

That's disk usage.

A

So this is suggesting that I could change the size of my node so that I'm using less space, it looks like container requests, aren't costing me much. But if I reduced my nodes, then I'd have less usage and.

A

Let's see it looks like it will eventually be able to suggest things like reserved instances and resources that are not in use, but.

B

That's not quite there yet, let's see.

A

And then it looks like this will also give you some cluster health information, which is kind of interesting. um Oh.

A

And it looks like so I'm running tons of community edition and this installs piniped, which is a external system to basically set up, oidc or or active directory type authentication within the cluster. And it looks like the concierge piece that does the open, id authentication or authorization has been throttled.

A

um So that suggests that maybe I should be running goldilocks there and finding higher resources.

C

C

A

Looks like I can get send emails or send slack messages when there are cube cost problems, which is a nice feature.

C

Let's see create alert budget.

B

C

Total cost rising and then the same thing.

A

Yeah I found piniped very useful um just as a kubernetes developer, because um I appreciate being able to test with restricted rbac um and the two ways I know of doing this are either to create a service account and then steal its token and use that as my auth token, um which always feels a little bit dirty to be stealing a service account's credentials or to set up, pin ipad and say: okay, here's an additional user.

A

um I use auth0, which has a nice free tier. um So you know look, please authorize this email account and then I'll log in through piniped and I'll have the restricted permissions and I'll also have the static admin auth token for managing the cluster, and then I can check and see. Does stuff work with restricted permissions? Can I deploy software? Can I you know debug this thing? Can I look at logs and so forth in the way that I'd expect um does port forward work?

A

If I need it, you know, would I be able to look at this dashboard with restricted permissions.

A

And unfortunately, this is something where it really wants to show stuff over time.

C

Let's see if we reload this, does it have the same alerts.

A

Expected cube state metrics version found.

C

The pricing source is unavailable.

B

B

Service address grafana, address team department. Oh.

A

Nice, so you can set labels to say who the owner of something is. um Similarly, you know what the team is, what department they're in and so forth, and these are configurable. So if you already have a schema for tracking these things and the owner is something else like you know, owner id, for example, you could just change this to say owner id and then it would pick up on those labels instead,.

A

And it looks like you can also.

A

Get billing data about s3 and other costs that might be associated with the cluster, but not in the cluster. um I don't have any of that set up. I don't have rds instances in this account, for example, so um I'm not going to use that, but it looks like for both aws and gcp. You can get additional costs and allocate them to teams.

A

Oh, this is nice, so um so there's a section here on sharing tendency costs. So every time you start up a kubernetes cluster, there is certain cost to just having a cluster like you've, got cube api server and cube dns and so forth, and it looks like you can choose whether to split those costs out among everyone using the cluster or to keep it back. So if you were an I.t department, that's really trying to encourage people onto kubernetes.

A

You might say hey for the next year, we're going to leave off sharing this tenancy costs, and so you won't see costs for what it costs us to run the kubernetes cluster.

A

We just want to get you on to kubernetes, we'll ask you, you know pay for your machine usage, just like you had to pay for your machine usage before, but you won't have to pay any extra overhead and then you know, maybe in a year or two, when 80 percent of your organization is on kubernetes, you say: hey, you know, everyone basically is on kubernetes. Now all our tooling is kubernetes. You should be there.

A

um We need to. You know we need to cover the cost of running these kubernetes clusters. You slide that the other way everyone's bills go up by a little bit, but it's not like the first person who shows up on a cluster. All of a sudden gets an enormous you're running this cluster bill.

A

um You and you can list which namespaces count as shared things. So, for example, if you are running istio, you might say you know the istio namespace costs should be shared across the cluster everyone's using those istio d resources- or um I don't know- maybe goldilocks and um cubecost- are part of the overhead of running the cluster and you shouldn't have to you know. Everyone should be paying a little bit for those.

A

It looks like you can also do it with labels and values, which is interesting. I wonder if there's a way to restrict that to only certain namespaces says, otherwise you could say hey. Everyone should be paying for my thing and just put the right label on it.

A

And it looks like you can also charge people for the idle resources in the cluster. If you were trying to say hey, you know, all the other departments have to pay for this resource usage. We've got no budget for it, um then you could say hey all that idle space somebody's got to be paying for it. On the other hand,.

A

You may not, you know, you may say hey that extra idle capacity is overhead. um It looks like you can actually come. You know, choose your own. Thank you. um You can choose your own um pricing for cpu and ram and so forth. If you're running on premise, for example, so you can report hey you know for a month it costs me 30 to run a cpu and you know 2.50 um for each gig of ram.

A

um You can see, we don't do spots, so you can set that to the same thing.

B

C

C

um Cube state metrics version check, so it.

A

Doesn't like the version of cube state metrics we have, but everything else is okay and it says it's collecting data check back in five minutes, which is interesting because I at least started almost 10 minutes ago, but.

A

Hey now we actually have some cost, so we can see that cube system is costing us well, idle, is costing us the most and then cube system, and then small and then some of this other stuff.

A

And we can go and look in one of these name spaces like cube system and we can see.

C

B

How do I get idle to go away.

B

A

We've got all these different things and right now, they're not costing us any they. They don't have enough information. The antraya agent, I guess, has a little bit of cost, but it looks like it's a half a cent worth alex worth collected um and you ah look. We can aggregate it by a bunch of different stuff, including um aggregating it by node. Instead.

A

um And so we can see that each of these nodes is generating a certain amount of cost and then idle is generating the most cost and I'm kind of curious. What idle is now um and if it shows up for any of these other.

B

A

Let's see if we look at small does this have? Oh, this looks a little different than cube system. Maybe it's because there's only one deployment in there, um but it shows us our cost and a cost efficiency which I'm guessing is actually similar to. What goldilocks is telling us that we could be asking for and using a lot less resources.

A

We can see that most of our cost is cpu.

C

C

Let's see if we look.

A

At one of these other things like pinniped, maybe it'll show okay, so this thinks that we are medium efficient on ram and not very efficient on cpu for pinniped, and if we go back to that, goldilocks.

C

B

C

That was pipette supervisor.

C

You'd think we'd have a common way of naming these things.

A

I just like to use cube control version to check that I can actually connect to the cluster because it will tell you the server version number. Sometimes. I also use list nodes which will give me a hint if I connect it to the wrong cluster.

A

B

C

Now we want port forward.

B

um Goldilocks goldilocks dashboard.

A

And so now we should get recommendations, since I've done that labeling oops.

B

A

Was a service name.

C

There we go and so.

A

Now we can look at piniped supervisor from the goldilocks point of view and compare that with the.

A

Point of view- and so it looks like this- would recommend um reducing the cpu and actually increasing the memory requests and limits for these pods.

A

And over here it says our ram efficiency is something like nine percent, so they seem to be producing slightly different suggestions. There.

A

I wonder if we look at the vpa behind it.

C

We can see what it's observing.

C

A

So it's labeled, but it doesn't have any vpa. I wonder where it's getting these recommendations from.

C

Oh there we go.

C

I wonder why it didn't have anything and why it came up with things before but didn't have the vpa.

C

A

So these are the same recommendations that we see in goldilocks, so.

A

Looks like it's recommending more ram and less cpu right now.

C

And let's see if we refresh this.

A

It seems to be happy, as is um to think that we're using about nine percent of our ram efficiently and the rest.

A

Not um if we go back here, it still says we're collecting data check back in five minutes.

B

uh And these are just all the settings for the cluster again.

A

C

Quotas and budgets.

A

I'm not sure what daniel's comment about quotas and budgets is.

A

Kind of what ran off the bottom of the screen over here.

A

A

Let's see, I think, that's about all we've got today if any, unless anyone has further questions about this stuff, um as far as I can tell, um cubecost is the main system that people use for figuring out how to do these chargebacks or they've built something their own internally that they're not sharing.

A

um There are a couple other companies um that have something private. um It would be interesting to see what we could do actually in terms of running cube, cost um on.

C

Its own, let's see somewhere over here, we've got.

A

The cube cost helm chart.

C

Let's see make this so that people can actually read it.

C

A

You can enable and disable prometheus or you can use an existing prometheus instead um or they will bundle in a prometheus.

A

um So that's all prometheus settings. There's the persistent volumes. um Oh, you can get it to create an ingress for you, so you can expose it um through an ingress, although there's no auth policy on the ingress. So that would mean that anyone in the world who finds your ip address can start. Looking at your cost information- and you probably don't want that.

A

You can also configure some tls. um You can use network policy to gate things off further, but um by default it's a service within the cluster. So you need to use port forward to get to it.

B

um Pod monitor for the network costume and set.

C

A

uh It looks like they have a separate thing where you can turn on tracking network costs, that's not on by default.

B

A

Grafana resources, creating a service account, so what they don't have documented here is this um cube cost token, which, interestingly, they have a static token here, which was different than the token that I got.

C

When I installed.

C

Let's see so they have a cost analyzer.

B

You don't want just actually all the repositories.

A

So they have a cost model um and a cost analyzer seemed to be the and then prometheus seemed to be the key.

A

So yeah we have a cost analyzer, we have cube state metrics. The cost model doesn't seem to run as a separate process.

C

So let's see these.

A

It's not quite clear how the cost model gets deployed.

A

It looks like it might just be source code in a library.

A

um And then cost analyzer is a bunch of target zips, which kind of feels like maybe it's built from somewhere else, but I don't quite. This doesn't quite feel like normal open source to me.

B

A

It's it's tying in with a bunch of open source stuff, but it feels like the core cost thing is maybe not entirely open. um Oh.

C

A

An api for the cost data model.

A

And then some apis that they claim are not open source so.

A

It looks like the getting the allocations of cost is done through their internal etl pipeline, and so you can't necessarily get that without shipping, your data up to cubecost and having them analyze. It.

A

And it looks like aggregated cost model is an older version of allocation um and.

A

This assets api again looks like they are shipping all of your resources up to cubecost and then letting you.

A

Letting you pull it back down, so it looks like maybe this cost model is the only piece, actually that's open source, and then they combine that with some non-open source stuff. So you can go over a window and.

A

Get some of this information, but most of it most of the interesting stuff um that gives you the allocation of how much resources were used. A particular time is actually part of a hosted service and isn't open source at all.

A

Goldilocks is, as far as I can tell fully open source, but it doesn't give you that aggregated view, it's great for an individual team to do the right thing on their own, but for like a business trying to figure out where their cloud costs are coming from. um It looks like there isn't something: that's open source right now,.

A

um So I don't know exactly when I will next be um doing tgik, but if you're going to cube connie you I will see you there and um yeah. It was fun to shake off the rust and uh get back into exploring this stuff. And now I know a little bit more of what I mean when I say oh yeah, you could use cubecost.

A

See you all in a couple weeks um and I think joe's going to make an announcement about what our upcoming schedule looks like I don't know. If we're going to do anything at cube county you or not, um tgik wise.

A

I'm probably not going to have much of my gear there, so you'll probably have to be someone from someone europe, local, bringing um you know stuff to do a broadcast.

A

See you all later.

A