Continuous Delivery Foundation cdCon 2021, 9 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Spinnaker Workshop Part Three - Cost Optimization - Ajay Tripathy & Webb Brown, Kubecost

Description

Spinnaker Workshop Part Three - Cost Optimization - Ajay Tripathy & Webb Brown, Kubecost

Join industry experts who will lead talks focusing on several key and core items that will give you an overview of the breadth of the power of the Spinnaker platform. In this interactive workshop, you will learn about:

How you can get real-time cost visibility and insights, helping you continuously reduce your cloud costs.

For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/

A

Welcome everyone thanks for joining today, uh my name is webb brown. I'm really excited today to talk about how you can use spinnaker and coop cost to dynamically and continuously optimize kubernetes workloads in order to reduce uh waste or reduce uh compute and overall cloud spend.

A

So I've got just a handful of slides to walk through about five or six slides, and then I was going to turn over to a day where we're going to spend the bulk of the time on a demo actually seeing these two products live and then finally we're going to save times for for any questions that you have so uh a really quick background on ourselves.

A

uh So, like I mentioned, my name is uh web brown, I'm co-founder and ceo of coop cost um I'm joined by ajay, who is also a co-founder and the cto of of coupos. We previously worked together at google for a long time. You know thinking about similar sorts of problems, around infrastructure, monitoring and optimization, and then a high level overview on coupe costs, which is an open platform for cost management, specifically built for teams running kubernetes.

A

We help teams, monitor and maximize the efficiency of spin um when again like running a kubernetes or set of kubernetes environments a little bit more context. The open source project was launched in 2019.

A

um Like I mentioned, our founding team, you know had been thinking about these problems for a long time, uh both with like internal infrastructure at google, as well as google cloud and external developer tools. um Coupe cost today is now ployed and deployed in more than a thousand different uh kubernetes environments. um A lot of different, you know, variants anything from like air gapped, and you know many on-prem environments um to the big three cloud providers, aws gcp and.

B

A

And then also a long tail of of other environments- um and you know kukos itself can be deployed in uh you know in minutes um or or less uh is you know, based on apache, open source cost project? um There's uh you know a free community version which we'll be sharing today and then there are also you know, enterprise offerings built on top of that, on top of those you know opened projects, so kucos itself really helps teams in three different areas.

A

The first is around really just gaining visibility into spin. So helping teams understand you know what um team application deployment staple set, et cetera et cetera any you know view of spin in a kubernetes environment and then the related spin outside of kubernetes, so resources like s3 or rds or bigquery, actually able to like allocate those back to those costs back to kubernetes tenants. And then once teams have this visibility, we really help teams optimize those.

A

You know workloads and kind of related. You know resources, so kubecos itself delivers insights that can be kind of statically applied or manually applied and then also dynamically applied via tools like spinnaker and others, and then finally, um google cost helps teams kind of ongoing. You know govern you know, cloud spend and waste here, there's a lot of functionality around like say budgeting and alerts and recurring reports, and you know chargeback, integrations, etc. That really help teams uh keep a handle on uh spend and its efficiency over time. Oftentimes in, like you, know, larger organizations.

A

So that's a really quick rundown on kind of three. You know major functionality, areas for for coup costs. um Now I'm going to talk about some of the kind of practical applications of this and the way coup cross works today. Is it integrates tightly with um kind of name, your own?

A

You know problem ql time series database as well as others um so commonly using things like prometheus and thanos and cortex, and then we build our coupos builds a bunch of etl caching pipelines on top of that database so that you can really efficiently uh at large scale and over large time, windows uh make really fast queries and those fast queries can be made from uh the open source. Coupe detail cost plug-in um made via you know: coupe cost apis.

A

uh Maybe uh the the coupe costs ui itself, which aj is going to show and then also you know, can be made via tools like armory and and others. um So you know really important uh for coupe cost to kind of bring the data to you, um both the visibility and insights that you um you're planning to apply into tools that you, uh you know regularly use whether that's you know any of these grafana. You know existing. You know bi or monitoring solutions.

A

So specifically, how does uh coupe costs uh and spinnaker work and again jay is going to walk through this demo in great detail now, uh but this really starts with the you know: coupe cost insights, api or savings api.

A

This is used to determine cost efficiency once spinnaker is going through its deployment pipeline, it actually uses those efficiency and determined thresholds and even context about your workloads to dynamically make a deployment decision uh to uh you know, adjust resource requests or resources provided to it by the kubernetes scheduler, um and that can be done. You know by when you manually build a pipeline or just done on a recurring basis, um and then you know that is actually you know configured via the kubernetes control plane and then this process repeats itself.

A

So good cost is constantly collecting new metrics uh by default. It scrapes data every one minute, but it actually can collect data more frequently, and then you know this pipeline or process can be run dynamically or or manually kind of based on on your choosing.

A

So that's a really quick rundown of kind of the high level components that aj's going to be talking about. I will now turn it over to him uh to walk us through kind of how these uh tools integrate and operate together. uh So take it away.

B

Hi my name is jay and I'm the co-founder and one of the original authors of the cubecost project. Today, I'm going to show you how to leverage cubecost and spinnaker to save money.

B

Cubecost starts by gathering cost and usage data for containers in your cluster, then aggregating them to kubernetes concepts. Here, we're looking at a breakdown by namespace, but we can also look at breakdowns by controller.

B

Service pod other custom labels clusters here, there's only one cluster and nodes.

B

This should allow you to get a clear understanding of who's spending. What on your infrastructure, let's say, for example, you're responsible for a namespace called acme air, you can filter for it here and look at further breakdowns of the cost uh you can see. For example, over the last seven days, acme air has spent 77 cents in cpu three cents in ram 78 cents, in persistent volumes, nine cents in network costs uh and 1.36 in other shared costs. um Here on this cluster, we've decided to share the namespace cube dash system with all other namespaces.

B

uh You can configure other, essentially any aggregation to be shared. uh However, we're only 7.7 efficient. uh What that means is for pods in acme, air cpu and ram requests on average are only being 7.9 cost weighted utilized so about 7.9. If you added these two together, 77 cents and three cents, you got 80 cents um only about seven cents of that is being utilized. It's about seven percent.

B

We can dive into a little bit more of what's happening here on the request. Sizing page.

B

So here we can see we've filtered by the namespace acmir and we've selected the production profile. What that means is in production. We recommend aiming for about 65 resource utilization to save some headroom for bursting.

B

However, even with this notion of headroom, we can see that only six milli cpus are being used and we've requested 100 ml cpus, which means we're way over provisioned. Here. This is likely driving the majority of the inefficiency.

B

uh Although we can see, we can make a small optimization by dropping our ram request from 100 megabytes to 83 megabytes, because we're only using 53.

B

We could edit this into our cube, ctl config, but what about when this changes or for new versions or images or if we get a sudden burst of traffic instead of manually updating this, we can create a spinnaker pipeline to automatically adjust our memory and cpu requests, and we've done that here. The way we've done, that is by creating a custom web hook that calls into the cubecast api for recommendations then automatically deploys us to acme web.

B

You can see here uh we make an api call with a couple of those target cpu and ram utilizations. We discussed earlier um as well. As you know, the window over which we want to run our request, sizing algorithm and the namespace and container name.

B

um After that, after that, we receive a uh a response from our api, with the new, with a new request, uh with a new suggested request that gets templated in that gets compared to uh the existing request to make sure where and uh the existing requests and a projected efficiency to make sure we continue to be efficient, and then it gets templated and deployed.

B

So let's give that a go right.

B

B

So I'm pulling up the acme webpod to see what its current requests are to just confirm that it's requesting 100, cpu and 100 ml of cpu.

B

If I can get this to cooperate,.

B

B

B

Yes, so we can confirm that it's right that it's still running at a 100 mil cpu and 100 mb. Let's quit that, let's uh kick off a manual execution of the pipeline.

B

You can see the get efficiency query being issued and.

B

Returning with a new request here,.

B

Being compared to the existing.

B

And being deployed so it should be done. Let's take a look.

B

It's uh been running for 31 seconds.

B

And we can see the recommendations have been applied.

B

So in this demo, we've manually run the cube cost, get efficiency stage for the uh of the and the whole deploy deployment pipeline. But you can, for example, add a cron trigger to update your requests or update your request, your request, every time a new image is deployed or any of the other great things you can do uh with spinnaker or your ci cd pipeline.

B

A

Questions all right, thank you aj. So it looks like we've got a handful of questions here. uh I see at least three um thank you for the questions and again thank you, everybody for joining.

A

um So the first question that I see answered is how do you calculate cost efficiency and a jay, and we spent most of our time talking about cost efficiency of a particular workload, uh whether that be a a pod, a deployment, a staple set or something else, um and this is the cost-weighted like measurement of the amount of resources that you have requested and therefore the kubernetes scheduler has provisioned relative to the amount of resources that you are utilizing.

A

So, if you're requesting a lot of cpus but you're, you know using a small fraction of them, you are going to have low cost efficiency and then, if you are, you know requesting a say, a relatively small number of cpus but you're using all of them, you're going to have high cost efficiency um so, like the the goal, is not necessarily depending on your use case, to always try to get to 100 cost efficiency.

A

It's about balancing the trade-offs between cost and reliability and performance so for that example on cpu. If you have really high cost efficiency, you may be at risk of being cpu throttled and if this is a production application, you know that may not at all be a good thing. It may not be worth kind of extra cost savings.

A

So when aj looks and presented these cost efficiency measurements and then the recommendations he would be taking into an account.

A

Those kind of you know more peak utilization or basically, the distribution of your resource utilization over time um really important concept when thinking about how to appropriately set these values so again, you're not at risk of being cpu, throttled or you know, evicted because of out of memory errors and and therefore, like again, really. You know just saying that context. Matters in the sense that, like um you know, different workloads may have different relationships between uh the goals of cost reduction and performance improvement.

A

So I hope that's helpful on on kind of like talking about cost efficiency happy to share more there. um Just let us know, there's a lot of questions and then the next question is uh what is idle, um and how do you come up with that jay? You want to take that one sure.

B

um Yeah, so anonymous asked what's idle. How do you come up with that? um So, let's say you're running, uh you know just one workload in the cluster right and it's taking up one of your cpus uh you've requested one cpu and one gigabyte of ram and there's 10 cpu on the cluster. um What and 10 gigabytes of ram.

B

So what we're saying is idle here is the those nine gigabytes that exist on the cluster but haven't been essentially requested or allocated to a particular to your workloads um so and let's also say, for example, you've requested one cpu, but at some point the pod bursts up to two cpu above the requests. Then, because we take the max of usage and requests uh in our notion of allocation, we would then say the idle becomes 8, cpu and 9 gigabytes of ram.

B

um So it's uh you can essentially think of it as kind of like overhead that has not been allocated to any workload running in the cluster slightly different from the overhead between you know, for this cost efficiency metric that overhead being the difference between what's actually being utilized and what's being requested by a workload. So that's kind of the difference between the cost efficiency number and the idle number, um but yeah super super awesome question.

A

Yeah and you know ultimately, it's it's really kind of a measurement of you know your efficiency of of bin, packing and also kind of you know, like cluster sizing, there's actually some other insights in the kucos platform that aj didn't share, but actually get to. That kind of you know on a cluster aggregate level, which is here we're looking at right sizing.

A

You know individual workloads or deployments, but also have those same insights to uh right size, your cluster in aggregate and again, the same thing would be true, where we'd be looking at kind of the shape of resource requirements over time, not just thinking about like median utilization but really looking at kind of peak demand, or you know, p99 demand, whatever's important in your environment,.

B

Yep um jesse asked: is there a multi-cluster aggregate view or way of understanding cost across a fleet of clusters? um The answer is yes, uh so I kind of relates to um another question in chat, which is: are you using the enterprise edition of cube plus in the environment? um The answer was no. I was using the community edition, as web pointed out in the enterprise edition uh multi-cluster aggregates are supported, um you can either install a thanos which is a prom.

B

It's a prometheus, uh durable storage endpoint, for that also does aggregation or if you've already got a thanos installation. We can plug into that or basically anything that speaks prom ql that you already do aggregation and we can plug into that or you can install um in our enterprise edition uh thanos or another multi-cluster aggregation uh tool.

A

Yeah, so so just recap on that, um like I j mentioned, he was using the like free community version um and to add a little bit more to pratik's question. Is that uh you know all of the features we covered again? Are you know free uh built on the open source um enterprise uh editions would give you know the feature jesse asked about which is like you know, multi-cluster aggregation, but also long-term metric retention and kind of like common enterprise.

A

Functionality like our back and samwell, et cetera and jessie, to like add a little bit more of what uh jay shared there is um when using solutions like thanos or cortex, or you know even a host of prometheus.

A

um That's a in our view, a really nice way of doing multi-cluster aggregation because you don't have to have cluster to cluster communication and worry about. You know firewalls or anything like that.

A

There are a number of other ways to do like multi-cluster aggregation, but that's generally the recommended path um and we're just finding that you know more and more teams have a cortex or a thanos, or you know like federated prometheus, et cetera, where they're already kind of you know, sharing data across different environments.

A

All right, so that is all the questions that I see no.

B

We've got one about uh how we, how we install this so.

A

Just do so, the question is, uh do you have the code for the spinnaker pipeline in github, or can you share it? Please.

B

uh Yeah we've got: we've got a setup for the uh for how to set up like a sample custom web hook, um this we're actually showing a development version of cubecost. So if you want to get that dev version, it's not yet in our like main line so or um so. If you just uh head over to our slack channel or email team at we can get you a build with the uh with the uh the api that we use in this maker pipeline basically simplifies another existing api.

A

And that and to add a little bit more there, that's it's in our nightly build um and then we are bringing that to production very soon here and we're actually going to write a blog post um and share a lot more about the backing architecture, and uh you know like share this code. um Basically like this, will this code lives in the like open source, coupe costs, and so it's really about just like joining armory and or spinnaker, and you know that open source uh coup cost cost model uh so uh reach out to us.

A

You know on you know, team at coupegos.com, or um you know on our slack channel. If you want to learn more um and again we're we're almost that time, but I think we've answered all the questions. So thank you, everybody for for joining today and yeah. Thanks for the great questions. uh Hopeful is helpful um again reach out at any point. If anyone does have questions and we're going to share more and more content on this, and also really just the starting point between you know, coupe costs and spinnaker integrations.

A

um We've got like we hand it to another of a number of other apis and insights that we plan on integrating, and so, if you do have a particular you know use case in mind. You want to see we're really interested because again we're going to be uh doing more engineering work here. That's that's hopefully interesting to everyone.

A

Oh we've got one last question here from phil. um What are examples? uh Improvements in in dollars or spend um phil, not sure if you're able to answer any or like provide any more context there um I'll I'll try to like answer it to the best I can. But um you know we regularly tend see teams uh reduce spin by 30 plus percent, um with going kind of through this exercise and and leveraging other insights and coup costs, um but we've seen that be like well above 50.

A

In our experience- um and this can be a combination of you- know, right sizing, workloads like we've showed you know identifying, abandoned workloads. um You know applying auto scaling. You know right sizing clusters, et cetera, et cetera. We've got about. You know 15 to 20 different, like insights in the coup costs uh product all in the community version um that are, you know, available kind of, depending on your context or how you configured your your cluster and all the like workloads in it.

A

So let us know if we can share more there, but again super grateful for everyone joining today hope it was useful in some capacity um and yeah. Thank you so much for having us hope everyone. Everybody has a great day.

B

Thank you so much guys, thanks for coming.