Cloud Native Computing Foundation CNCF Webinars, 16 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Reducing your Kubernetes Cloud Spend

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

There we go all right. Welcome to today's cncf webinar, reducing your kubernetes cloud, spend I'm libby schultz I'll, be moderating. Today's webinar we'd like to welcome our presenters today, webb brown, ceo at cubecost and nico kovacevic cto at cubecost. I hope I didn't just butcher that again a few housekeeping items before we get started during the webinar. You are not able to talk as an attendee there's a q, a box at the bottom of your screen.

A

Please feel free to add your questions there and we'll get to as many as we can throughout and at the end this is an official webinar of the cncf and, as such a subject to the cncf code of conduct, please do not add anything to the chat or questions that would be in violation of that code of conduct and basically, please be respectful of all of your fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the cncf webinar page at www, dot, cncf, dot, io, slash webinars.

A

With that I will kick off today's presentation and hand it over to web and nico.

B

Excellent, can you can you hear me, okay, libby,.

A

Yep, I hear you great.

B

Excellent well, thank you so much uh for the introduction- and I just want to say- welcome everyone uh thanks for joining us today and we're gonna talk about uh one of our all-time. You know favorite subjects that's near and dear to our hearts, which is helping you effectively manage or reduce, uh spend when running workloads on kubernetes.

B

uh What we're going to do today is first we're going to present can an overall general framework for how to think about different optimizations or opportunities to reduce, spend and then we're going to go into some very practical examples or war stories that we've, you know, picked up uh over the past several years, uh while working in this area, and so let me first start with just a little bit of background on us, uh so my name is webb. I'm joined by my esteemed colleague nico.

B

We are both part of the founding team at coop cost. We build cost monitoring and cost optimization solutions for teams running applications. On the kubernetes platform, we have more than a thousand teams using our product today across all major cloud providers as well as on-prem, and we're going to talk about again. Some of the lessons we've learned by working directly with hundreds of them during that experience, a lot of what we're going to talk about is aimed at cloud environments, but a fair amount of it does actually apply to you know on-prem environments as well.

B

So why are we here? uh First and foremost, we very much believe that kubernetes as a platform presents an amazing opportunity to deliver applications more cost effectively.

B

We strongly believe this and feel, like we've, seen this in many migrations and many production environments, but we also believe that there are certain things that can nudge teams towards a risk of kind of overspending or increasing spin if they don't uh focus on these areas, and so there's kind of three core reasons that why we like believe this to be true first, is that when we see teams move or fully embrace kubernetes oftentimes their decision-making process, how they're, actually deploying or updating applications is more decentralized.

B

This is amazing in terms of development or developer empowerment, but it can actually be harder to monitor everybody effectively.

B

Secondly, is oftentimes, we do see teams having higher velocity when moving to kubernetes.

B

This is again a great thing, but it oftentimes leads to more just moving parts um to to monitor and more dynamic systems, and this is again due to you know faster release cycles, but also things like you know, auto scaling, which are programmatically modifying releases as well, and then part three. um You know now uh developers are empowered with the ability to spin up. uh You know all kind of resources whenever they need them. You know this can be. You know hundreds of gpus in any region in the world from any major cloud provider.

B

Today and again, this is an amazing thing, uh but it also means that mistakes or kind of oversights uh can be more expensive uh when they're not called so you know, these are just three things that we want to. You know keep in mind as we think about kind of immediate optimizations, but also just ongoing governance of running you know, workloads and in kubernetes.

B

So now that we kind of have the overall kind of problem framed, we want to present uh a high level uh like function or or framework to think about uh making optimizations or or like providing a solution to that challenge, and so anytime, we're gonna make an optimization we're going to think about touching at least one of these variables, specifically we're going to be impacting the amount of time a resource is provisioned, the quantity of that resource or the price of that particular resource.

B

And so, if we look at each one of these, if we look at the amount of time something is provisioned. This can be the amount of time your cloud provider or provider is actually billing. You for that particular resource, and so you think about doing this effectively. Something like cluster, auto scaling would allow you to you know, shorten that time or adjust that time to just the period where you actually need those resources.

B

Part two of this is the quantity of the resources you're provisioning. So again here, if you think about kind of optimizing this at a high level, this can be like a right size equation. We'll talk more about which is getting just the right say, amount of ram or amount of storage for your particular needs based on your applications, and then part three is is really where the kind of finance component comes into play um specifically around. You know, looking at the cost of each cpu or the cost of each gb of ram et cetera.

B

So this is really kind of getting into the more financial optimization part of the equation, and then this particular function can be further abstracted away and think of it as kind of resource efficiency times the price of resources nico's going to talk more about efficiency in a second, but we think this is a really powerful framework for getting a a really quick glance at how effectively you're you know, provisioning and consuming the resources you're paying for and so before, nico dives into that, I want to hit on a quote that we think is is really good.

B

That really thinks I like about the bigger picture here, and it's a thorough quote, which is the price of anything. Is the amount of life you exchange, for it um uh definitely meant to be a little tongue-in-cheek but kind of two big points here. um One is that you know we're not just talking about the price of cloud resources.

B

Your time is also really expensive. As a you know, an infrastructure engineer, so we really want to think about um getting the like biggest impact uh for, like you know your time dedicated to it um and then, secondly, is this. These changes can also oftentimes be like hard to estimate in terms of the amount of time that it will take for you to optimize it that day and then also like manage it going forward. So we want to try to give you a sense for difficulty estimates knowing that there will always be contextual.

B

uh You know differences across different environments, but also that there are common themes and you know common patterns when we look at this across again hundreds or even a thousand environments, where we've seen our product deployed.

B

So uh with that, um I want to turn over nico he's going to touch on an import, a super important part of a kind of a precursor to optimizing. Your infrastructure specifically around measuring allocation efficiency, et cetera, nico. You want to take it away.

C

Sounds good yeah thanks bob um I'll share my screen, really quick and then we'll step through basically a concrete version of the framework that web just introduced and familiarize everyone a little bit with some of the metrics that our open source coupe cost project, scrapes and computes, and provides teams with the ability to do some of this cost optimization.

C

So um here we're looking at a dashboard of aggregated metrics from our kubecosta open source project, and you can see that things are aggregated by namespace here over the last day.

C

So I'd like to basically just unpack the little heuristic equation that uh webb just presented time: quantity of resources, price per resource um and also touch on this efficiency metric. So efficiency is something that we think is is one of the most important ways of thinking about this problem.

C

So looking at this dashboard uh an example of how you might step through this is something like um noticing that we've got 2.2 percent efficiency here, as you might guess, we would consider that to be pretty low. So we can sort of step through why that is and get to the root of what metrics underlie this. How to think about them, and then that will lead us into uh later in the talk talking about how to resolve some of those issues or up that efficiency, number reduced, spend, etc.

C

So we're in the default namespace looking at memory and cpu cost. Basically, if we just drill down into this we'll see uh within this namespace over the last day, we've got a number of containers uh running in different pods and uh this a b and c these three columns and then the total cost correspond directly to the equation that web covered, which is time running so here we've got 24 hours running for all of these over the last day, which means they've all been running the whole time.

C

We've got cpus, that's the specific resource we're looking at, but we could just as easily be looking at ram, for instance, or so. This is how much is being allocated to each of these line items and then uh price per cpu hour, so price per resource.

C

So um you know, predictably, this corresponds to the node that these are running on, but it's broken down to specifically how much you're paying for cpu for that workload and then that yields your total cost. So basically, these are. These are your levers if you want to be paying less you've, either got to reduce your hours running, which may or may not be possible.

C

You've got to reduce your the amount of resources, the quantity of the resource which may or may not be possible, but commonly commonly. This is um this is a big one and then price per cpu hour, which is also a big one, but again we'll step through that later. For now, we can actually drill down one step further.

C

So if we just look at basically a grafana dashboard of the raw metrics for one of these we're looking at um basically a test deployment pod that isn't really doing much. So what we're seeing here is basically uh on the left. You see cpu, cpu, usage and request and on the right you see memory usage in request.

C

So this brings us to back to the idea of efficiency and um when we talk about allocation at kubecost, we're talking about the max of cpu usage and request, because what we're trying to do is with the metrics that we're emitting we're trying to share with our users what you're actually being billed for.

C

So um when we look at these, we can see basically from the cpu perspective, we've got a request, but we have essentially zero usage and from a memory perspective, we've got you know, 30 percent usage, but cpu is actually the primary uh cost of this pod. If we go back uh to the top level, we will see, um 14 cents has been sent spent on cpu and only one cent on memory.

C

So, even though our memory has you know, 30 usage, it's such a miniscule amount of what is being spent and cpu has zero, so we're hovering really close to zero. We're we're basically a two percent efficiency, so um this is basically the way that we think about um our core metrics and then this provides us the three levers to say: okay, what should we? What actions can we take now? Who can be alerted?

C

um How do we, how do we basically start here and end up in a position where we're spending less so I think the plan right now is to pause briefly and take any questions from this before we move into specifics on how to fix some of these issues. So.

B

Yeah, let us know if you have any questions, please feel free to drop them in chat or q. A at any point.

B

um This is this is such a critical part um to to lead into the next part of the discussion, because it's you know the the nico the example nico gave, I think is great, which is you know looking that at you know the default name space you know, 14 out of 15 of every dollar spent is spent on cpu, so it actually spins makes very little sense to look at like optimizing your memory or you know, storage or network from that namespace again, just because you're able to see like the biggest cost drivers and whenever you're you know allocating your your, like.

B

You know very costly time. We want to steer towards those areas where you can have that that biggest impact right, so you've got a handful of of questions here nico. So, first one is: um how did you fix the cost per cpu yeah great yeah? You want to take that one.

C

Sure yeah, um thanks for all the questions, uh definitely some good ones. Basically, uh if we go back to this screen, the question is basically about this column and the simple answer to the first question, and the second question I guess, is that we integrate with cloud provider billing apis and we're aware of which nodes your workloads are running on, and thus we know uh how much that node costs and what the workload which node the workload was running on.

C

So we can come up with the price that you were paying at the time for that workload per resource, and then I believe we also had a question pop-up about on-prem environments, and in that case we just allow users to input how much their uh hardware is costing them. So you can you can input custom pricing and sort of override this or provide provide, whatever makes sense for your situation for on-prem.

B

Yeah and in there nico maybe worth hiding like we actually have two different pipelines to how we support that. We have a really simple pipeline where you can just say you know: cost per like core cost per gb of memory, et cetera. We also have like a more advanced pipeline for teams that have a lot of heterogeneous assets and want to actually you know, go through and have an individual like asset id for each vm disk et cetera, and you can actually again kind of tag those with each.

B

You know a cost for each individual machine right.

C

Right all right.

B

um You've got a couple more here, um I think you've hit on the first four there. One next question is you have a product or service uh business model or both? um So uh like you know everything we're showing is you know showing metrics from our open source project? We do have a business and enterprise product with a lot of like extra functionality available.

B

um You know you think about you know: saml sso, our back multi-cluster view a lot of these things that are useful for teams in bigger environments, but we also work really closely with our users to help them like optimize their spend when they like onboard uh to our to our product.

B

um So next question for you, nico is if the workload is distributed across multiple nodes. uh Will you take out? Take the average.

C

Right, so this is a great question. um I think we get questions like this. A lot um and part of part of our answer here is that I think we're looking at the problem from a slightly different perspective than this. So in a sense we are yes, but really what we're doing is taking each instance of each running container separately, so we are instead of trying to average things and break them down, we think of it as aggregating.

C

So if we look at this example again, we've got price per cpu hour and you'll notice that they actually aren't the same, even though we're talking about cpu usage in the same name space. Basically, what is probably happening here is that these are running on different nodes and those nodes might have different prices associated with them.

C

So when we look at it from the top level, we are saying basically like we've aggregated every every running instance of every container and every pod in each of these namespaces that have their own individual situations and individual pricing perhaps to arrive at this price. So we don't given that, like an individual instance of a container can't be running across multiple nodes, uh there isn't any average necessary uh if, if we're looking at the problem from that perspective, so I hope that I hope that helps.

B

Yeah, no that's great, and- and one thing I would just add, is um like that's the exact model we we implement, which is truly building. You know, from the container level up um teams do have the ability, if they wanted to like override that pricing, they can always use, provide like custom pricing. Sometimes we see situations where teams want to create like an internal economy which may not like perfectly reflect what their cloud provider is billing. So we do have that capability. It's actually really similar pipeline to what nico just mentioned for on-prem environments.

B

So you've got a couple more questions here. All all great questions. Thank you. Everyone um next question is: is there a range of savings based on your experience with various customers which services have larger potential per savings on.

A

B

As an example so yeah, I may just say that we're gonna actually get into this uh some more with like uh five really practical examples. uh After this um we, it is not uncommon for teams that, like haven't focused in this area, to be able to reduce, spend by uh 70 or 80 percent. um That's uh that I would say, that's like really common for teams that are able to like devote you know real uh kind of like engineering resources to doing this optimization.

B

We can talk more about specific examples, though so question is, could you compare coupe cost to fin up uh efforts complementary or somewhat overlapping? um So we are part of the uh phenops organization we're actually like a founding. uh You know vendor uh with the recent launch, um we're huge fans. We think it's doing great things and fully support. You know all the like openness that it's bringing from a training and certification perspective, as well as just general education perspective, so definitely complementary and we're you know involved on things going on there.

B

So in case of aws, are you considering only aws fargate pricing? um I'm not sure. If I fully so, we would basically just be reflecting the cost of the node uh where these workloads are being run. um We also have um and we can share more resources on it. But if you look at this notion, there's one column here called external cost. In the view that miko is showing, um if you had say let's say you're not running like you know, eks on fargate, you're running just other workloads in fargate or you're running.

B

You know, uh like rds instances you have, you know, like s3 storage, buckets et cetera. We would uh allow you to like allocate those uh cost back to the actual kubernetes tenant. So you can get just kind of a centralized. You know unified view again, whether that be fargate or anything else outside of kubernetes.

B

um Okay, so that's lots of great questions from q. A I see. I also have some here in chat um I'll. Maybe we can split these up niko. I can take the first one really quickly because there's another one that we're going to touch on in a second sachin s.

B

I can imagine how this would help downsize in the cluster etc, but real workloads are burstly. So isn't this head room for quality of service super super relevant? um This is where not only like uh quality of service uh comes into play, but also the like nature of your workloads uh and specifically around like usage patterns.

B

I think you kind of point out that these workloads are very stable, like resource requirements.

B

This can be true in production environments, but is typically less true, so you know again, this should absolutely factor into your decision making process when going through right sizing, and this can impact if you're doing dynamic, right sizing using like a cluster, auto scaler or, if you're, doing more static, which we'll we'll talk a little bit more about later in the presentation um next question is: does kubecos run as a separate component in the same cluster, or can you run it outside the cluster nico? You want to take that one.

C

Sure yeah today it it runs in your cluster, so we've found uh from the teams that we work with that. uh It's actually been really valuable to be able to run it entirely from within a cluster in within your own cluster, because um basically there's a whole a whole slew of issues with like egressing data data, privacy, etc. That teams can sort of run this product and they don't have to they get all the metrics and they don't have to worry about.

C

Privacy concerns um yeah there's we do have some some people who have wanted to run this as like a sas solution, but for now uh yeah it runs in your cluster right. Alongside things.

B

Yeah, and, and so that um that presents a number of like really interesting behaviors, as well as um like nico, showed you with grafana. These metrics are written directly to like a local prometheus instance, so you can do a bunch of cool things like create custom dashboards in the cluster. um You can, you know, set up alert manager for custom alerts from these prometheus metrics, all like miko said, while owning and controlling your data and not having to egress any of this right, um but you can.

B

Alternatively, we have a number of teams, do this that take these metrics and send them to some external like bi tool or some like hosted solution like say a data dog where they like monitor. You know other infrastructure metrics right, um so a couple others and then we'll go through these really quickly and we'll have time for uh questions at the end. So um so does the does the price take into account savings plans? Absolutely we'll talk more about this.

B

It would take into account if you're running workloads on spot or preemptable nodes, if you're using ris savings plans et cetera. All that would be reflected, and just like nico mentioned part three of that equation, which is like the cost of those resources.

B

um So what should be the best solution? You recommend I'm not sure, custom if you're. If you have uh any more information you can share there, I'm not sure fully understand that one but happy to come back to it um so uh running outside to use it for multiple clusters.

B

um So sanjay here, yeah, like you, know our uh not to get too distracted but like our enterprise product would use like a thanos or a cortex, where you can have all of this like outside of your clusters, and then you can have just a totally like unified view of all of your different cluster environments.

B

um So tons of great questions, um I haven't gone back to the q a but I'll circle back to that uh towards the end of our presentation, but I will jump back and let nico share some of these very practical examples of implementing like optimizations. Now that we have this kind of newfound visibility or, like uh you know, framework for cost allocation right.

C

Cool yeah thanks webb, um so basically we're just gonna step through um five of these, but there are. There are many more that you know.

C

If we had all day we could keep going, but basically these are some of the top five anti-patterns for overspending that we see teams routinely either not know how to solve or not be even aware that there's a problem until they start basically analyzing some of these metrics and then it becomes painfully obvious where the problems are and well also, for each of these give our take on, like from a general perspective, how we think it's best to solve them. So.

C

So the first one is orphaned resources. um We would categorize this as uh fixing pulling lever one you could think of in this equation, which is time running, and uh this is actually a pretty easy one. Once you see the problem, which is just that there are often resources uh cloud resources in your infrastructure that aren't doing anything, they don't have an owner, they're, just sort of sitting there and you're paying for them. So uh basically this could be. um You could think of this as ips uh persistent volumes is probably the biggest one.

C

We've certainly had teams install our product and quickly figure out that they had. You know tens of thousands of dollars of disks that were just sitting idle without owners. Load balancers are a big one. I think I think load balancers and ips. It's really easy to think. Like oh yeah I'll just like expose this, and then you know, maybe the project gets handed off to a different team um or during tear down you.

C

You know eliminate the deployment, but you forget to eliminate the load balancers and over the course of months or years that piles up and you can. You can basically just find uh a treasure trove of uh of things that you can just eliminate and and stop spending money on. So um we consider this to be uh pretty easy in the grand scheme of things.

C

uh The impact uh certainly can be high, it's probably not quite as high as some of the ones that we're going to get into, but uh we consider it easy because by definition, these things are just not being used. So normally it's a it's a straightforward solution.

C

um So speaking of solutions, um basically having a mechanism to detect when orphaned resources cross a certain threshold is, is the solution and then having uh basically some some notification system where, uh within your organization there's um you know a hierarchy of ownership where it's like hey.

C

If this thing is sitting in a name space and it's not being used and it's exceeding a certain amount, someone gets alerted and then they can come come loop around um so yeah, that's it's a pretty straightforward solution, but uh the solution really is just uh delete the resource and stop paying for it and how you implement that to some extent is up to you, but it generally revolves around having this mechanism of identifying an owner and then communicating to that owner. This is you're spending money on this.

C

You should probably check it out and probably delete it.

C

All right next one is abandoned workloads, so, um as you see on the slide, workloads that do not provide real business value, um we'll talk about heuristics for this. uh How we think about this- and you know some of the teams that use our product, how they think about it and um what we think should be done about it. But again, this is sort of like a category. One thing which is time running um you've got a workload running on your infrastructure. That's chewing up resources, but um so it's maybe it maybe is doing something.

C

It's you know we're not saying that usage is zero, but usage could be through the roof, and if it's not providing a real solution, um then it for all intents and purposes, might as well not be doing anything so sort of like one step more complex than than orphaned.

C

Resources, so what you're seeing here is a dashboard that was built on top of our some of our open source metrics, um and this gets back also to uh the throw quote from earlier, which is like your your time is also something we're trying to help. You optimize uh just a quick glance here at this dashboard you'll notice that we have basically one workload that is causing you know whatever. That would be 90 90 to 93 of this uh of this overspend.

C

This 107 overspend is just in one workload, so um something that we're really trying to help teams with and that I think the teams using our our products and our metrics have been successful with, is just finding that that low hanging fruit, that's a big win, and for not a lot of your time. You know like this, this last one on the list 25 cents a month.

C

Maybe maybe you'd be fine, if you, if you let that one run but um 100 bucks a month, you'll want to take care of so and and uh to talk briefly about the heuristics here we can also field questions on this later if people are interested, but the way that we uh recommend teams measure whether or not something is abandoned is with uh network traffic.

C

So if a pod is chugging away uh chewing up resources, maybe it even has a request, that's higher than the resources that it's it's actually using, maybe not, but if it's not uh egressing any data anywhere else, uh we use that as a heuristic for saying like is this thing really being used? You know it might be computing things, but if it's not sending that result anywhere, then it's at least a flag of like you might want to revisit this and again as we move into solutions for abandoned workloads.

C

This would be basically like a medium difficulty, because it's tougher to know uh it's not as easy as this thing doesn't have an owner at all. We're talking now more like this looks a little fishy, but if, uh if we contact the owner of this, uh they should be able to justify it and often they don't even realize that it's still running so, as you see listed here, common examples, deprecated deployments.

C

Maybe this is something where like responsibility for it, um you know shifted from from one team to another team and the new team didn't wasn't even aware that it was still running and the original team didn't tear it down. um Dev environments is a huge one, so this is one where um you know we have. We have some other open source um projects related to like cluster turndown. That could address this, but essentially like your dev environment on nights and weekends is sitting there with a request.

C

If you don't have it turned down and you're spending that, and it's really not doing anything so again. The general theme here lack of awareness, organizational changes, things like this, um where things fall through the cracks, but we can see like huge impacts from from abandoned resources um and then again the solution is basically to it's very similar to orphan resources, um set up some sort of alerting rule dashboard.

C

Where um there's a point of communication who is an owner for you know, let's say like a common one: is people will have owners by namespace, so you assign ownership by a namespace, and then we can go in and say: okay like here are all the abandoned resources in this namespace.

C

It crosses a threshold of how much we're comfortable, essentially wasting on things that aren't doing anything so send that owner an alert, saying, hey, come check in on your namespace. I think you might have some things to clean up.

B

Awesome thanks so much yeah I'll.

C

B

Over to web, for the last.

C

B

Yeah I'll I'll take it from here, um so those are two of the five um you know. Number three is is kind of like a a catch-all, um we've seen a lot of uh like war stories uh or like unfortunate circumstances. Here we say that, like these are workloads that are behaving in unexpected ways. um You know common examples with these would be like an application bug.

B

So we've seen you know actually a pretty recent story of um you know essentially like an infinite loop uh that, like auto scales, resources that cost you know tens of thousands of dollars. uh We also have had a user that had like a bitcoin miner installed in their kubernetes cluster and that plus auto scaling led to just a huge burst in like resource consumption.

B

um So these are. These are kind of those like long tail of unexpected events um that when they happen, uh can be even like in a relatively short amount of time, uh fairly costly um and so there's the problem of kind of addressing that particular event and then there's the also the problem of kind of like having monitoring or governance in place to where you kind of minimize. Those events happened when these are kind of you know present often times they're meaningful.

B

I think there's a little bit of like selection bias here on our part, but like when teams present them to us. They're oftentimes, um like you, know, a real part of their their spin. um This definitely crosses into the like medium, if not like. You know medium hard category just because there is a really long tail of things to monitor. For you know, part of the solutions that we've seen are like really just monitoring for kind of unexpected changes in spend or like spend anomalies.

B

A common pattern would be like looking at, say the moving. You know seven day average for like the cost of a namespace or cost of a cluster etc, um and then having a mechanism to you, know, notify team members um and and like being able to take action quickly. um One of the beautiful things about kubernetes, which is a real change here, is that you in kubernetes metrics.

B

You can truly have like real-time cost monitoring, alerting and not have to wait until you get a bill from your cloud provider which may be like you know days, or you know many hours later so kubernetes metrics, whether it's prometheus metrics, directly integrated with like alert manager or another solution, can get you. This visibility in you know real time or near real time.

B

So I'll jump aside there, so number two is starting to get into the the third input in that equation, which is you know, managing the price of resources- um and this is you know, touching on usage type uh and specifically when we talk about usage type, we talk about selecting across you know on-demand versus, like spot or preemptable versus making reservations, whether it's committed use, discounts, reserved instances or savings plans, it's really about kind of going above and beyond just using like basic. You know on-demand uh uh instance types um this.

B

uh This can be, you know hard just because it really involves an effort of kind of predicting the future or you know forecasting the future um oftentimes you know, finance like will get involved if you're as part of a bigger organization. So, just kind of you know managing that across teams can be difficulty difficult or again kind of accurately predicting the future, but it often times can yield really big benefits for teams that do have some predictability in terms of like their their spend going forward, and this is kind of like a high level visual.

B

We want to present because we think it's a really powerful framework- and this is you know, looking at using on-demand versus reserved- and this is with a cluster, auto scaler, helping you kind of dynamically adjust for different kind of workload, demands or usage patterns in your product, and what you generally see here is that, as you have more and more predictability into that kind of base level load that you know will always be there. Whether this is from a compute and memory standpoint or gpu standpoint or something else, you know: data, storage, etc.

B

You have high confidence that you will again maintain that level of base load for at least 12 to 24 months.

B

You can start layering on more of these reservations and again have major savings, and then that, coupled with like auto scaling on demand, uh nodes uh can be a super powerful framework and again you know, yield 70, plus percent savings and then a very similar framework uh when looking at spot or preemptable usage would again to be stacking on those reservations as you as you have like more and more uh predictability and baseline load and then letting spot um availability uh scale. uh You know naturally, given those kind of marketplaces.

B

um This is a really big one and, and you know, kind of considered hard just because it requires architecting, your kubernetes uh workloads in a way that they are resilient uh to, like you, know, node termination or like regular node failure. um So that's definitely one thing and that could be touching on. You know like managing replicas, and you know pod disruption, budgets et cetera, but for again for teams where that is a potential fit. These can yield huge benefits and then our last example is.

B

You know one that we had a question about, uh which is uh you know? It is really easy to come to kubernetes and say: there's just so much complexity, I'm just gonna start by over provisioning resources, and so that way will minimize the risk of you know: downtime performance, uh you know, bottlenecks etc, and this is makes you know, total sense and- and we actually recommend for teams that are brand new to to kubernetes, to just follow that pattern right.

B

It's only when you really reach production at scale to where these dollars can become meaningful enough to where it makes sense to kind of make these investments and really go through kind of this right-sizing exercise- and this is not an uncommon scenario where, uh when we first start working with teams, they regularly will have up to you know 80 or sometimes even 90 percent, idle capacity or slack capacity, um and just like we mentioned this is oftentimes just uh making sure that you know they have tons of ample headroom for for burst, um but teams that go through the exercise to kind of uh measure that an actual like measure, you know peak usage or say p99 usage.

B

They can oftentimes reduce this um in a in a major way without looking at some sort of auto scaling, so just doing that statically. So that's part one and then part two is really taking into you know that quality of service uh kind of sla from this cluster can have a big impact as well so oftentimes. We see teams take kind of a uniform strategy with over provisioning, whether it's a dev cluster, a staging cluster, a prod environment or like a critical or ha environment when in reality you can actually apply that context and oftentimes.

B

You know b stay much more comfortable with you know, running at lower compute and risking kind of cpu throttling in a dev environment, because the impact of that may be relatively low. uh You know, given your particular, you know circumstances, um so you know when taking that to account we've seen.

B

Teams, have you know major major wins just by going through and doing this exercise statically and taking that profile into account, and maybe you know programmatically once a day or even once a week, uh making that adjustment uh and doing like a bin packing exercise where they see where their workloads will be kind of best fit, uh given the like instance, types that are available from their cloud provider and then the other part of this is if your environment is a good fit for auto scaling.

B

Auto scaling can be hugely valuable and again you want to think about kind of the sla of those workloads and the architecture of your workloads, and if it would support you know, nodes coming up and going down efficiently.

B

You know one big caveat with autoscaling is it can be hugely valuable, but actually, you know there's a fair amount of complexity and there's a lot of kind of rules that can impact the performance of auto scaling, whether it's pods using local storage or not having disruption, budgets, etc.

B

Having a tool to like manage those rules effectively can also make a big impact in terms of auto scaling. You know allowing you to recognize the benefits that are possible from it and again. This is something that we regularly see. Teams you know reduce uh kubernetes spin by 50 plus uh percent.

B

um We do consider it like you know medium, if not hard, just because it can actually um create risk if, if you're, not careful for when you do have like a bursty workload uh that you would not have, uh you know extra headroom to support that. So definitely you know anytime you're going through this. We recommend not thinking about kind of the median case, although the median case is helpful in this scenario to get a high level understanding, but once you start moving into optimizations really thinking about you know closer to.

A

B

Usage and what the impact of a right sizing exercise would be on peak utilization, so um we have just seen time and time again that you know avoiding these patterns can reduce spin by you know, 80 plus, um when done right. We think you can do it with uh you know, without creating kind of any performance or reliability concerns oftentimes. It's just a useful like cleanup exercise as well, because you get rid of you know.

B

Like nico mentioned, you know, abandoned workloads that may just be consuming resources that may create security risks that aren't actually providing any value.

B

But again we want you to think about uh when you're pursuing these um to try to focus on. You know the biggest like bang for your buck, given how valuable your time is um and starting with that allocation piece. So you know where the biggest opportunities uh first, for you know, spin reduction are um so we love talking about the stuff reach out to us at any time team at coopcoffs.com.

B

If we can help- and then we've got a little bit of time for for any questions that we have so I see, we've got a couple more here.

A

Yep y'all take it away. This is awesome, but we have about seven minutes, so I will leave it open for all of you, fire away in the q, a box and we'll get to as many as we have time for.

B

Awesome um so question here is: does uh vmware or oracle have a similar cost optimization tool, um I'm actually less familiar with the like offerings of of either? But I do know that vmware has the cloud health product which does provide cost optimization solutions.

B

Then the next question is for stateful applications.

B

Workload around uh it sounds like about a thousand customers comes at one time, so then, how many pods and how much cost is required to support that um nico. You wanna you wanna, take that one or you want me to you sure.

C

C

Yeah, I I'm not really sure what the question is. I guess.

B

Yeah, let me let us know if there's more context you can provide. um I do think that nego's um exercise they went through for measuring efficiency and going through like pod right sizing can be super valuable, and I would say like that, combined with something like hpa can be great. If that like stateful set is or staple application is, is architected in a way to support that.

C

Yeah, I would say just in general, like comparing your usage and request, is a good exercise and doing that you will see either that your usage is for a long period of time, well below your request, or perhaps that it's actually too high. We see that sometimes, in which case you're risking you know, eviction things like that, so just using those two metrics from from the grafana dashboard earlier in the presentation, and then there are other ways of like running statistical analyses for, like you know, p99 p85, depending on how um you know.

C

Is this like something that you don't mind if it gets killed? Is this high availability? You never want it to be killed um to give you know heuristics for for what sort of like overhead you want to maintain, but uh to some extent it's kind of up to you.

B

Yeah yeah, no, that's great! um So next question kind of you know. It looks like two part here um so in the case of applications that are running at say, 60 to 80 percent idle. um Is it better to use a a serverless solution?

B

um You know like a kubernetes native kubleth um and do we recommend that uh another example provided here is open fission, I'm less familiar with with open fission there, um but um I think, like this kind of um get to that, like broader picture, whereas cost is kind of like one part of this decision, it's also kind of similar to like you know, should you migrate cloud providers for cost?

B

um We think it's a very like relevant input to this equation, but oftentimes like performance, availability and like functionality are like very core parts of this as well, but I will say that, like we do see serverless as a like useful tool for managing costs for sure um what we've seen is that you know we regularly work with teams that have, like you know, medium or either very high, uh like complexity applications most of the time like it's hard to move all of their workloads to server list.

B

But even if they can move some component that can be, you know super super meaningful.

C

All right looks like we've got um a question about the presentation link. I would check out the chat. I believe there's an answer there about that yep. um What is a generally allowable percentage of total capacity that has been observed? Empirically I assume bin packing is not optimal all the time.

C

um I think this varies slightly from team to team, but so in in some parts of our application we will use a notion of like profile, which is to say um it depends on the priority of what it is that you are running, we would say, probably as high as like somewhere between 75 and 90, if you don't mind getting evicted so if you're on a.

C

If this is like a dev thing- and it's like- oh, you know like 30 seconds of downtime here there is okay, but we really want to like squish down the cost. I would say definitely a more generous overhead um for a high availability. I my gut says like 60, like 40 percent overhead, something like that. I webb might have a different answer here, but it sort of depends on your situation.

B

Yeah, I think it's surprisingly common, where we see teams land at like you know, 35 to you, know, 40 percent uh overhead um and I think it's a function of just what nico mentioned, which is again like you know, quality of service, um but there's, I think, two other things that come into play here. One is like variability of like you know: resource requirements, the example that nico showed you things were super stable.

B

So if you have just a bunch of long-running batch jobs, you may be able to get to like 90 plus percent utilization, because you know resource utilization is, is really stable and then the second part of that is, if you also have like high, you know, predictability in terms of resource utilization. You know looking forward, um so I definitely seen scenarios where teams are in the 90s, but I do think it is um is, is like you know not the the norm at this point um and then so. We've just got two more minutes here.

B

We've got two questions here at the end, um so one is recommended books or resources. um There is a book uh called cloud phenoms that I think is really really good. um It is uh authored by the one of the main authors is uh one of the creators of the finnops foundation.

B

uh J.R stormont, um definitely one that I recommend it kind of paints, a holistic picture of just you know, managing spend in cloud environments um and then the last question here is: I think it's a good approach to invest in uh auto scaling with ml uh techniques, especially in vpa cases, um absolutely think it can yield benefits, but we definitely recommend starting with simpler solutions just for like introspection purposes and just like understanding why things are behaving the way they are.

B

We especially recommend that if it's in like a production, you know critical environment, but then you know really when you're. Looking at like fine-tuning this, you know down to the last dollar we have seen. Scenarios where you know ml can be, can be very useful in doing that, but oftentimes we when we first start working with teams, there's like bigger wins with just kind of you know, just investing the first time to like do. You know right sizing.

B

Do kind of um you know all of the like, uh you know, exercises that we went through here today.

C

Yeah, I I think it's fair to say that we could even be stronger on that point and say something like we would not recommend it if you haven't gone through yeah the path of like understanding what's going on, or else you risk sort of creating a second layer of um misunderstanding of this of a similar nature as what we're trying to solve trying to help team solve in the first place,.

B

Yeah definitely and again that can you know impact not just cost but reliability or uptime as well. As you know, just general performance with you know: cpu throttling, for example, um all right excellent, well we're out of time, but want to thank everybody again for all the awesome questions and for joining us today. I really appreciate uh libby and the team met at cncf for for making it happen.

A

Of course, thank you both so much for a great presentation and we look forward to seeing everyone again soon um check back on the website later today, and we will have all of this good stuff loaded and ready for your enjoyment talk to y'all later.

A

B

A