Kubernetes Office Hours, 17 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Office Hours 20200617 (EU Edition)

Description

Office Hours is a live stream where we answer live questions about Kubernetes from users on the YouTube channel. Office hours are a regularly scheduled meeting where people can bring topics to discuss with the greater community. They are great for answering questions, getting feedback on how you’re using Kubernetes, or to just passively learn by following along.

For more info: https://github.com/kubernetes/community/blob/master/events/office-hours.md

A

B

White north.

A

The great white north, what do you say we get started? Welcome everybody. It is the third Wednesday of every month. That means it's time for the kubernetes office hours. If you're in the channel listening, please let us know how the audio sounds. We always like to make sure that we sound him. Look good, so I'm Jorge, Castro I'm your host. Let's start with some introductions: let's go Chris Mario, Pavel, Oz, Dave and Marco you can. As the new person you can go last.

B

Hopefully, I won't trip over my tongue here I'm. My name is Chris I'm, a customer engineer with Google clouds public-sector up here in Canada. My background has been primarily running: on-prem kubernetes installations, till I joined Google, so hopefully I'll be able to impart some of that knowledge on you and looking forward to helping any way. I can today.

C

Everybody, my name is Mario Lauria I am a senior sre for stock X in Detroit Michigan. We are retail about e-commerce platform. I own arthur, Burnett used infrastructure on EAS, as well as a lot of developer, centric things like CI, CB, helm and other things like that. So my ninjas really lied in the eks parts. Kind of configuring operating clusters, life cycles on life cycle and helping developers understand a horizontal pod, almost killing, and things like that. So.

D

D

People call me Paul for simplicity, I'm, a certified kubernetes administrator I do a lot of work around monitoring, monitoring, kubernetes, so yeah. That type of thing.

E

My name is Dave strable I work on Azure cloud team as a cloud native architect and work a lot around day, two operations, networking security within kubernetes.

A

And I'm Marko chappie I'm, the director of engineering of vapor IO, which is a small start-up, that's building edge, colocation data, centers co-located on cell tower sites, and we leverage kubernetes entirely through our infrastructure, both in cloud as well as on in inside each data center for control, plane services and I'm. Your host today, Jorge Castro I started this program to learn about kubernetes and I still I figure. We would do it as a community program together and I'm a community manager at VMware, so we're gonna go over how this works feel free.

A

Those of you who are listening in in the channel and hash office hours on slack, say hello. Were you from what you're working on we love to see the chat scrolling during the middle of the stream, which you should see here on the side, but let's go over some ground rules.

A

Here we go so before we begin. Let's start by introducing. Oh we did that here's some ground rules. This is a kubernetes event, so the code of conduct is in fact, so please be excellent to each other and never really had an issue with that. This is also a judgment-free zone. Everyone had to start from somewhere. So please help out everybody. We really. There are no dumb questions. All skill levels are welcome to participate.

A

So, let's, let's try to be positive there when it comes to all of our levels of expertise, and while we will do our best to answer your question as a panel doesn't have access to your cluster, so live debugging is off topic. We can't like really ssh to your stuff and, like fix your control plane that kind of thing, but what we can do is try to help you get into the right mindset or where to look to find your problem, so you can at least unblock whatever thing.

A

You're stuck on and at least hopefully give you a direction on where to go. So you can like progress in fixing your problem here, panelist you're encouraged to expand on your answers with your experiences and pro tips. Part of the reason we asked you to come on is because of your expertise and your production experience things like that. I'm audience. This is a participatory sport, so you can help by pacing in URL, so the official Doc's blogs or anything that might be relevant to the topic at hand.

A

I know that we have a lot of experience out in the audience as well. So please feel free to you know, help out by just typing in the chat. I know every time, if someone's asking for tools or something there's, always a new github link that someone drops and we learn about a new tool and things like that and then what I do at the end of the show. I have all the URLs and I whack them into the show notes.

A

So we have some reference material for those of you who can't come live feel free to post your questions just directly in chat just but question colon or something that makes it obvious. So we can see it and then what we do we'll stick that in our notes and then we'll get to the questions in the order that we received them. If you have a premade question from discussed, kubernetes dot, IO or Stack Overflow, you can just link the pace to pace the link to that.

A

So we could read that, so you don't have to rewrite your question. If you already have it, you can also help us by tweeting spreading the word paying it forward. Anything that might help people who are using kubernetes, you know be exposed to this content will be appreciated and, as always, if you stick around to the end, we do a raffle where we give away a snazzy, kubernetes t-shirt which I'm wearing today it's one of these. The way it works is we'll pick two.

A

If we've addressed your question on the air, all you have to do is ask it. We will raffle out two t-shirts, so we're giving out two today and two this afternoon. There is a West Coast session in about two hours after this livestream that will cover that part of the world. So this is the EU session. So with that how's everyone feeling today, let me check out the notes. Joe is here: Konstantinos is here awesome Ahri's, here's got a question. I see it Vishnu.

A

Thanks for your question, it looks like Demetri has a question: okay, let's get started, Marko I want you to kick it off real quick, because we asked her questions and you pace it. An entire paragraph. I know a lot of it is background information, but we do get questions like this at all. So let's just spend a minute to talk about it. So what's your question? I hold yeah free stuff. After I press ENTER in the slack I was like you know what probably I probably should is elsewhere, but yeah the we run.

A

We run a bunch of edge data, centers they're unmanned units, and you have a bunch of interesting hardware that we run there, that, like taps into like all of our ot systems like power management and all that stuff. So it's it's, this crazy hardware platform that is basically a motherboard, some ramen, some CPU and some disk, and we have like ten six of ten of those we were in kubernetes on it. It works really well for us.

A

Kubernetes helps to keep our like 400 plus pods, that do all the control playing stuff running there, but we have two pods in particular that need storage ones, a stateful set, it's a Postgres database and one's a deployment. It's a Prometheus database and the problem we have is that these hardware nodes will power down occasionally and they have unreliable BMC. So I can't like go reboot or remotely there's no ipmi, so a tech has to go and reboot them.

A

The problem we've encountered is when it comes to storage and persistent storage, either whether it's a pod or a stateful set. When that node goes away, kubernetes thinks that the storage is still attached and I've read through the code. In the comments, this is expected behavior because they don't know when that node will return, so it doesn't want to remove the storage, an effect of deadlocking or doing right, terrible things to to corruptions of file systems.

A

So our problem is, is we need these stateful sets to be running off than not? We don't care as much about the persistence of that storage like the validity or the the safe regards there, because we know the notice powered downs like there's nothing mounted there anymore. We've found a way to have kubernetes, short-circuit and re mount volumes by deleting the node from our cluster. The biggest problem that we have is that, even when we do that, kubernetes will still sometimes say. No. No.

A

This volume is attached somewhere else, they're already running attached to a pod, I'm, not gonna, reallocate it until we do a bunch of like forced deletions and some other things from github threads and issues that have gotten us to the point where, like we can somewhat reliably very intensely recover storage. My question to the panel in particular into the into the community, is: have people encountered. This I know on cloud providers they're very good about you know when a node goes off. Liners are moved because of that really really good integration with those cloud providers.

A

Remapping storage, this stuff is all possible and most saw providers provide readwrite many where we have rewrite single or rewrite one in ours. What's the best way for us to kind of move forward until we can retrofit our hardware to be more reliable with a reliable out-of-band controller, how can we make sure that these workloads come back online and storage attaches appropriately in a reasonable time, other than kind of codifying the hacks that we've done today to recover the cluster.

A

Paragraph I typed thanks for talking, it's simple for the first question of crying okay, I.

D

Think most of the issues describe this mostly relevant oral cancer right. This is what you are actually using. So.

A

We are using rook and Seth, but we found we've used other mediums. We settle over concept because we have separate experience, but any any storage driver that is a read/write single, has the same effect where kubernetes won't unmount or remount the volume storage, because it sees it as being mounted elsewhere, whether it's at the I'm not sure if it's something that's with the CSI driver for Steph or with the other ones we've used.

A

But when we use something like NFS and read/write many, it's not a problem because we can explicitly mount multiple pods to a single data stored. So we haven't found, and maybe the answer is: here's a better storage provider but other than like NFS and a few others. We haven't found as equally performance of storage compared to SEFs our bb's to provide us with the kind of we do a lot of it's like a lot of writes on these devices.

A

So that's that's our problem is we can't really use an alternative storage provider that eliminates this with read/write many, but we haven't found either a solution to this or a more performant readwrite many storage class provider. I guess.

D

A

It's more than stuff, unfortunately, I really wish. That was an easier fix. Yeah.

D

Because Amazon and we basically we can shut down the node, then we can still like Amazon EBS actually does this, and it's still like that pod would actually unmount and go to another node and actually mount there. So I think this is more like a problem with diversity. Acai drivers are yeah, it's a fork in particular.

D

One of my suggestions is: you can actually just run things on local modes, for example for Prometheus. You definitely the one that be running this son, rook or itself, because you can just have like move the pope prometheus replicas, each writing in into local disk, and then basically, if one note goes down, you still have another replica. We just constantly step scraping for metrics the problem.

D

If the current approach, for example, is that, basically, if your itself goes sort of stops working doors or it just goes down completely you, you will lose all your monitoring because basically rook is down. Prometheus can try to do its volume and yeah so yeah.

A

That's a really good point that we could definitely react, attacked Prometheus to be almost like a daemon set where it just runs with a local volume, like you suggested, yeah.

D

A

Would definitely eliminate that issue for us we still have a Postgres database, that's critical to our our flow of data from our devices to our Prometheus, which is still a stickler, but I suppose. Another thing we could do is similarly just run Prometheus in a clustered configuration or sorry Postgres in a cluster configuration, removing them or block storage directly.

A

That's super interesting. We hadn't considered that architecture of it.

A

A

That seems a little too simple. That's.

A

The controller manager and storage, but it removes our need to worry about it for now, which is a good stepping stone for us, because we have a little ways to go before we get better hardware in these sites. So the real problem is the hardware we're just yeah attacking it from a software perspective, yeah all right, well, mark I, hope that gets. You started in the meantime feel free to hang out on the panel. In the background, if there's there's a question, you want to help answer. I feel free, I, know you're babysitting today, alright.

A

Moving on to the next question, thanks Marco for dropping by, we have Ari thanks for coming, says one thing that everybody uses: kubernetes forests, I love this question by the way, one thing that everyone used the kubernetes force to make sure that the pods have enough CPU, a memory on the notes through resource requests and limits, and the kubernetes scheduler does a good job of making sure that the nodes aren't over scheduled in terms of CPU and memory.

A

My question is: what about disk I/o and networking the underlying node storage and networking will have a real limit to IAP set cetera? Are there plans to get kubernetes to allow positive defined storage network requirements and not over scheduled storage and networking requirements on nodes which cannot handle their requests panel.

A

You I've always wondered this myself. Like you know, usually people are talking about CPU and RAM limits, but then, like yeah.

F

A

You ask about nose, or just like just put the fastest storage you can on it. Yeah.

B

That's not something I'm einde before that's a good good scenario at least I. Haven't it that kind of scale where it's become an issue, I'm, not sure about other people, yeah.

A

My your your muted, if you're, trying to I, think Murray's having tears.

F

Gas model issues yeah.

A

Now you're showing up as muted, but when you talk it doesn't hear anybody else had thoughts on this.

D

So I remember basically, kubernetes has like an annotation which allows you to limit network throughput, but it doesn't allow you to request it or that kind of thing mm-hmm, so yeah. So there is basically, if you are Chennai plug-in supports it. You currently can like limit one part to say: let's say please just use up to like 10 megabytes per second or something like that right. So that's already are basically unknown. Notation in kubernetes yeah, yeah I. Don't have any would.

B

You be able to do that through custom metrics. Maybe it's a little bit more artisanal, but you might be able to do something off of that.

A

A Mario you're muted, if you're, trying to you're, metered and zoom, if you're, trying to okay Dave any comments and again.

E

I was just going to say, I know some of the different software-defined storage, like port works and rook in that do have things that you can set like priorities and guaranteed quality of service on the storage volumes.

E

But just like raw dis from say, a cloud provider typically doesn't have that type of feature where you can limit what the storage is well.

D

E

Said it is a very important topic because I see a lot of users run into this, especially from like the actual OS just that gets mounted, because there's a lot of things actually hitting that OS disk, if you're the doctor, daemon paths there. So all your logging, everything is.

D

E

That disk and if you're limited on that you're going to get weird just flakiness in your cluster, where nodes go not ready, it could impact like tube system pods, those type of things.

E

So it's something you definitely want to monitor for and keep a good eye out on for those storage, metrics I think Prometheus probably does the best job with this of the node exporter on that where you can get very good detailed information from the storage layer out of it, but from the limiting, like IAP spur like pod in that you'd probably have to look at, like some software-defined storage right now to get those type of features.

A

Okay and the next question- hopefully hopefully that'll get you started. I know I've seen enough seen, issues and github there. You know for people asking why I just can't, like you know, set a little limit like you can with everything else, but I don't remember the details on that.

A

Are any other comments on this one before we move on.

A

Okay, moving on Demetri hope, I got your name right, says good morning. I have a question about cluster utilization. I am setting resource requests and limits for our services and I was curious. How to approach these? Should I set request, bakes on peak load or should I allow limits to handle those? My current request, setup leaves my clustered about less than 20 percent utilization and I was wondering how I should approach increasing my utilization, interesting question.

C

Testing, can you guys hear me yep, awesome, cool I would love to respond to this one word and I going through the same thing right now, so research requests and limits are incredibly hard to really understand without actually analyzing and spending a lot of time effectively. Looking at your workload over time, especially from a developmental perspective kind of asking developers to do this a little bit, it's a little bit hard to do. They really have other things.

C

Other priorities, there's some tools out there that help with this, one of which I think we mentioned before, is called Foley locks from the Fairwinds Ops teams, and really that just provides a fancy UI to the vertical autoscaler, which sits and kind of observes and then makes recommendations for possible resource requests that you could set. I.

C

Think another thing to keep in mind here is that the different QoS classes, first of all, is probably where you use it for the most part, if you have like a request that, let's say a hundred milliseconds but a limit, that's 200 milliliters and there's a there's a better one, I actually better asterisk there.

C

That is guaranteed where you actually have your request and limit exactly the same, and that basically gives your application if it does need to inflate those resources already reserving guaranteed in that house with, you know your nodes, having issues as well, and you trying to overcome scribe the allocation that you will have for your nodes. We've earned into all of those sorts of issues, including working with developers to help kind of get this.

C

This set up, it's not easy and they are required for things like HPA and really for cluster autoscaler, as well, so to have an effectively kind of Auto scale resilient cluster. You know they're kind of a little. There is no perfection here.

C

A lot of it's just trying and I think my big thing that I tell people is have something especially a limit, especially request, even if it's a little low um the other thing too- and this is I, believe it I think there's startup resources, I know, they're, startup, probes, I, think sort of resources might be a thing or a cap I'm not sure, but we have issues with some of our node apps where they start up and they take over a CPU core, and then they come back down in normal state within 30 seconds and they're back to 200 millisecond put a limit on that because, what's what's the point of doing a limit, that's 1.5 CPUs right that just doesn't make sense.

C

So there again, this is not perfect. I would say, look into Goldilocks see if you can get you at least a UX perspective of some suggested values and really look at your your critical workloads, the the bigger workloads on your cluster, especially daemons that since April sets as well. So we just had an issue with data value agents that we're just kept inflating kept inflating, no limits on them.

C

In the default performance and our cluster always kept auto scaling, we got to 220 nodes when we really only need 94 for our basic platform, its normal state, so definitely from an Operations perspective. This takes a little bit of investment, so I'll shut up now.

A

Any other recommendations, I.

E

Would just reiterate kind of what Mario's like resource management? Some I see a lot of people struggle with, because it has a lot of impact on other things like if you're using the horizontal pirata scale or how does requests work with that. So it's something you do want to invest a lot of time in is really understanding resource management in kubernetes.

E

Just because I see a lot of people struggle with that. There's a lot of little things you kind of have to know about it from what mario said about the different classes of burstable guaranteed those types of things that you just really have to invest time to try to understand it as best as you can, but you'll never have it perfect day, one or even day 365, but just invest that time too. Try to understand it.

A

And and I wanted to get this on the video, but the bid tree asks I follow up on why 1.5 will be too high for you mama? Is it because it's bursty and then you'd waste reservation, yeah.

C

It kind of a waste, even define I mean you could still do it definitely and if the application still inflated and kept going going going, but really at that point, you're you're, just setting it for the sake of setting it you're not really providing much value, and it's so high that you know if it's a 200, a minute jumps to 1/5. You know that's kind of taking that note a little bit depending on if you set your allocatable for the for the note as well.

C

So when you ever describe nodes like that things are not happy and that's why that's the first of all QoS class. Let me put a link to the QoS class stuff in here as well. So.

A

That would be fantastic, so my Mario's drop his links there. Any other comments on this topic.

A

B

Was just gonna agree with everybody? What we did was we kind of ran bid.

B

What resources their need and then we set them afterwards, so just observe, with metrics find out where you're happy and set it for that.

D

Yeah also like where I work, we also struggle a lot with resource management like a lot of teams. Just over provisioned resources are under provision. It's just a super complex problem. So right now we have started looking into a vertical paddle together. Basically, that's software you can run and which would basically automatically pick correct resources. It is ours request for you, so yeah I'm right now investigating this approach. Maybe it will work. I I really hope it will I know.

D

Google actually has written a paper about this, and most of like people in Google, like I, think 90% of them actually use automated software called autopilot, so I I do believe. Basically, software can solve this resource management problem.

D

A

Right, the questions keep coming, keep on asking them. If, if I've missed your question, please just let me know- and slack next question goes from comes from Joel Davis, who says: is there anything in our back that allows you to select against certain namespaces, for example, if I want to give someone the ability to create new namespaces, of which they have access to star verbs on star resources, but not interact with a given set of main spaces? Is that possible to do through our back.

A

Let me see the replies. People are asking c10 wants to point out that over chip does this using an operator that creates Auerbach resources on the creation of a new namespace, and it has a link to that which I will put in the thing here into the main channel any option. Any ideas on this max guy says OPA might help you.

A

A

A

A

Wow we got stumped alright, let's, let's keep this one open here. Oh Joel has some information, says: there's nothing built-in, because you have to bind the permissions within the namespace after creating it a web hook or operator is about the only solution, so I think if you're not using OpenShift, is there a non opus version of an operator like this.

F

A

Through the readme here.

D

Yeah it feels like OPA should help solve this problem now. Iii have never used the VA be honest, but yeah.

B

You could pre declare things in OPA to prevent that, but it's not really touching the our back. That user is still have to be kind of pre granted permissions and our back to create a namespace.

B

Our open would just be or gatekeeper would just be checking the API requests, so it still altima need the art back permission to actually do anything.

B

Yeah I'm just trying to think of how other people do that with the identity providers and stuff. It might be something in there. But this.

A

Is interesting to Zanna dear, has a link that says: there's some work going on a multi-tenancy working group for this hierarchical namespace controller HNC, I've not heard about so that link looks interesting.

A

hmm Anybody else have any ideas here: yeah.

E

I've always seen a know, few that have just created their own web hook or operator. To do this exact same thing, yeah.

A

So it sounds like an operator is gonna, be the way to go there. So, hopefully that helps you out. Joe feel free to post follow-up questions or a lot of people are just responding to each question and threads. So please keep that up there. Joel speed, welcome back, says something. That's come up to my work recently when hosting metrics endpoints in an application. Do people do people? Should people require off and RZ to access them ie, something like cube are back proxy, be put in front or the functionality be implemented into the application.

A

D

So I believe right now, prometheus folks are actually working on this and so, for example, node exporter that recently added I think like basic off or something like that for metrics on point so I believe client metrics will also have similar feature if it doesn't already so yeah, basically, I, don't think you really need proxies because it will be supported. Natively I'll try to find some good links. Sure.

A

Any other recommendations for job we have to Joel's today in the audience.

A

All right Joel that will hopefully get you started, bow Louie, I hope I got that right ass. Is there any storage, backup recommended solution? I know this comes up every once in a while I'm yeah.

E

I would say if you're looking at a open source solution, Ford Valero's, probably the one I've seen most utilized out there. There are some startups that are doing a lot around back up like Epson and that but Valero's one I've seen probably most adopted.

A

Yeah I'm biased since I work with a Valera team, so I can't really I.

F

A

Can't talk about how great it is.

B

The other thing, if it's for, like storage on the PVCs, your storage provider, might have a separate solution to backup individually, like that so Oh cluster state Valero I for nothing but amazing things about or even if you're following get ops here, you have your state ich with and get but for storage yeah. That's that might be something one look at the provider for.

A

Yeah and if anyone in the audience also has any different recommendations, feel free to just toss the the link there in chat all right anything else on backups.

A

All right next question comes from Andre, says high volume question on the volumes documentation page we can read. Kubernetes supports several types of volumes, generic. What does it mean? Kubernetes supports particular seems, glossary, FS client comes with height width, hypercube kubernetes node also seems hypercube is going to be deprecated. How about kubernetes continue to support?

A

Such kind of volumes will be great to hear about how to bring the latest most fresh cluster FS clients on to node and unlock developer, so they can use the latest cluster FS, which deployed somehow somewhere or from our kubernetes cluster as a client.

A

Sorry, this question is hard to read resource questions. Would you wait? Alright, that's the second question. I think.

A

Let me try to find the original question here in the channel thoughts on this one I like.

A

Okay, here we go so here's a volumes, doc page that they're, referring to you.

A

Let's just have a look.

A

A

Okay, so types of volumes, Alisa usuals right like as your disc, sefa, fast cinder. You know port works volumes. So what the first question is, whatever you mean by support like and also I'm gonna, add up my own little thing here. How does this differ from like I thought everything just supported CSI and then you would just get that right or just like a native support thing. I think I've confused myself pop you wanna taste, it yeah.

D

Yeah, so basically one of those types of volumes is CSI. So right now, kubernetes has a bunch of integrator clients, for example aw, yes, EBS or Azure discs, so yeah so generally supports means that you can actually cubelet kubernetes node agent can actually connect to that type of volume and mounted to your pod right. So yeah.

A

Okay, so it seems Gloucester, Fest kind of comes with hypercube to kubernetes node. So let's look at the link that they sent here so I'm confused, so I'm totally confused here, just kind of showing my lack of knowledge in this area. I thought everything just talked to CSI like through CSI, and that, like nothing really like, talked directly to the storage like I thought, this is being moved out of core like.

A

D

They'll use old ones, and there are some new clients which use yes, I got you do.

A

We generally tell people use the CSI interface if available, or are people just directly so lead.

E

To cluster offense, in this case.

A

We used flex volume for a really long time, because I couldn't get CSI to work. Cuz I didn't read the documentation very well, but once we moved the CSI.

C

For our rooks F stuff, it was a very nine-day experience for, like our storage, observability I'm.

A

Pretty sure Gluster has a CSI driver, so that's if you're not using CSI today it might be worth investigating trying it out with the driver you're using because there are a lot of native kubernetes objects as CSI provides. It gives you way better in.

C

A

On what's happening with the.

C

Storage and that's how we discovered a lot of our failings from earlier like was just moving. The CSI gave us a ton of more information and.

A

More configurability, it's is it safe to say that probably most current and future development is going towards a CSI side of things and not them yeah. What do people call these native drivers like? What's the name there's a page on the CSI site that has like supported drivers, and there is like an order of magnitude, more supported, driver.

D

A

Then, in like the traditional flex, volume entry supported driver, so yeah there's definitely a lot of concerted efforts. I think storage, vendors that want to be a part of kubernetes no CSI is the way to put their driver abstraction into Kate. So a lot of development is going there if it hasn't already reached they're relatively mature state yeah I was just I was just surprised when I when I. You know, when I read this I thought everything had moved to CSI by now, but.

E

Yeah I would say it's going to be very dependent on your storage provider and if their CSI driver is stable or not like for just, for example, like an azure, we have a CSI driver, but it's maturing and getting to a stable state. When that's supported, that's when you would use CSI so I'd be dependent a lot on your storage provider, whether you're using a cloud hosted storage provider or a different type of storage, whether its supports CSI.

D

ah Just quickly after that, I mean some of the entry drivers are actually a bit more resilient than TSI drivers. At least I had some experience with CFS and basically fuse type of drivers. Then you just kick one driver, node and then a bunch of pods actually lose connection to basically two data, so I mean it really depends on technology you're using and yeah.

E

D

Link to a couple of interesting issues around that got, you.

A

Alright also, they have a follow-up question, so notes for Aundre check. What is CSI? Is there a cluster of SCSI thing available, I'm like reasonably certain? That's what you want to go, go with yeah.

B

I just dropped the link to the CSI.

A

B

A

Alright nadir has a follow-up question related to the last. We talked about this already a little bit, but would you recommend to limit cpu flash memory for cube API and Etsy D running as static pods, I.

C

Cannot answer that because I'm using cloud which they manage it for me, I would say everything especially critical services should have limits and you should have monitoring in place so that you know warnings when the things are getting hot, 70, 80 %. Something like that right. So those are those are operational things that, like you should just be on top of, but they should have limits because they they can again impact other. There were other workloads and there is a problem there as well. So one handed off sooner rather than later.

A

Anybody else yeah that seemed pretty straightforward, alright and Ray I hope that helps you out. Let us know how you get on with that. Can Abbott asks hi. Is there an official slash recommended graph on a template for monitoring the entire kubernetes ecosystem metrics on API server, cubelets scheduler? What other tools are people to get a big picture of it? That's the I want a really cool dashboard. What do I use so yeah.

D

Julie I brought those dashboards. They are available in kubernetes mixin, which is basically it suggests a net thingie which generates you a bunch of prometheus alerts and Rehana dashboards. I'll give you a link. Yeah enjoy awesome.

A

I'm go ahead and trimming trimming the.

A

Monitoring mix-ins, okay, that is good to know so, they're, just a bunch of goodies in here that I should know about yeah.

D

Yeah, basically, that are a bunch of like I, think about ten dashboards. Now.

A

D

I brought the dashboard on for API server cubelet, you proxy, all the components there are a bunch of cool people, contributing yeah, definitely check it out. This.

A

Is why I love the shell like just stuff like this is like perfect awesome, great you're.

B

Getting a star for that one yeah.

A

That's awesome sure I didn't even know that I didn't I didn't even know. Kubernetes monitoring was like a namespace, so I got a dig through there. That's always a good one awesome. So let us know how that goes. Everybody and everybody thank that repo by giving it a star when you get to it and if you use it next, mahir sha asks how to set custom metrics for HPA. Also whoever answers this questions tell me what HPA is.

C

ah Horizontal pot autoscaler well, what's.

A

The TLDR on that thing, just yeah I, want to make sure that you new listeners also sure.

C

Yes, horizontal pot, all those killers, another top level, API object. What it does is it monitors the percentage of your instances, the average usage for either CPU or memory or other metrics as well. They set thresholds on and it measures that against it, specifically for CPU and memory what you requested so going back to those resource requests. We talked about. Let's say if you said: 100 millisieverts it and your application, the average of all begins. This is part of that deployment is up to 80%.

C

Of that you have a percent threshold set on your HP a then HP will start to take action. That's either scaling in or scaling out, depending so either killing instances that aren't needed or adding an sis to meet that threshold that you set. So the question here is specifically around like external metrics or custom metrics. Usually, what happens is there's a controller that can provide those for you.

C

So, for instance, we have we use the dog and they have a cluster agent which can provide custom metrics and that cluster agent kind of taps into all the metrics that the agents are collecting so basically anything that we we have that gets reported at about any metric at all, that's available there through you know in the US or do native or we're recording up through a service, etc. We can set thresholds and have HPA scale on that right, and so that that makes things super super nice.

C

We haven't done a lot of the custom metrics stuff, yet we've really honed in on the CPU and memory, but you can really like again. This is gonna, be a rabbit hole if you kind of know some of the key key metrics. So maybe, if error rate goes up, you want to add some instances, but more so latency and other sensitive metrics around your application. You can do that, so you should also include your look at things like monitoring for that and alerts as well.

C

It's not a hundred percent, so also I also want to point out I'll link it here in the channel there's another project, that's in the CN CN CF health kita and that actually taps into other constructs like, for instance, an alias s. Qsq size can be something that you you set a threshold on and say: okay well for over ten thousand entries and ten thousand cute entries, then you know we need to scale up or something like that, so that that makes things a lot easier and I think this is.

C

This is really clearly a lot of things the community's been bringing lately so I highly recommend people look into it.

A

Anybody else have recommendations.

A

Alright, we got about 15 minutes left and about two or three questions in the queue so keep on asking audience. If you have questions, meanwhile, we will get to the next question which Mario answered with a bunch of links and Mario I'm gonna ask you to take those links and toss them in the main channel here, but Jojo Perez. Is there a tool for testing kubernetes service? Latency Mario responded with a bunch of links, but let me give the pant the other panelists a chance to feel this, that they they wish.

B

To me that yells kind of service, mushy kind of stuff but I'm gonna, try to find what Mario said, because there's probably something smaller too yeah.

A

You have left like artillery. Do you want me to paste those I got them right here, Marco, Mario, sorry! So.

C

I'm Mario he's Marco, I, know good and pace. um I can't seem to find that thread right now. Yeah I got him.

A

C

Actually yeah I was gonna, say I just so we I mentioned before that. We were kind of in a serve as much research mode and I. Actually me and my coworker put the other Google Doc, just like vomited, a bunch of things in our brain and stuff. I had started github and all that and these kind of were the frontrunners when it came to performance, testing and and specifically, I.

C

Think the big ones for me are the blue, shot the Kate, CNI benchmark and probably ripsaw in terms of I, want to see both like service latency latency is out of the cluster in the cluster services service. Things like that those are the sorts of metrics that we were gonna, try to kind of observe when it came to you know, certainty and eyes, ie, link or de or something used, convoy, etc. So, there's no like perfect. It really depends on your needs somewhere before declarative than others.

C

I really like what still has been doing as well. Ks smashes again just and everyone knows the chaos stuff that came urging kind of on Netflix and others 18 different solutions for that, so that that does that for a service mesh, artilleries kind of a go to first over our front-end teams, as well I, just from like very outside I'm on my laptop. How is performance so again be careful with these tools: I don't DDoS yourself, but yet they they can they're pretty configurable and they provide a lot of a lot of rope.

C

To hang yourself with, like.

A

Rip saws like an operator, and it claims to be the Chuck Norris of cloud benchmarks. I just think. That's interesting, yeah there.

F

Are a lot more tools in this.

A

Space than I was expecting.

A

Any other comments from from people on this one.

B

Thanks Mario for providing that list.

A

Are you still in the exploratory phase or actually at? Have you actually put these through their paces kind of thing? No.

C

We were still exploratory I, think we've narrowed it down. Most of it, mostly to like, probably probably not doing, kuhmo, which was Kong's, were probably focused more on link or D and console right now, sto, you know we're a small team, we're still kind of a small company. I think it was just a little bit more complex. I know, there's aspen mesh, which is really making that all of easier and providing support services as well for it, but I I think it feels a little bit over ahead in terms of what we really need.

C

It I think linker D is kind of where we're leaning into right now, but we're still going we're going very slowly. There's been a few emergency fires, dumpster fires that we've had and.

A

As far as these tools that you link to have you, have you have you put these under anger, are you still investigating these? No.

C

No we're we're still investing in these I would say that the CNI benchmark and the and probably rip saw and glue shot like I said like those are gonna, be my like when I do, when we do get there, those are the ones I'm installing first for the most part. So me anybody.

A

Else have recommendations here or experience. Anyone in the audience feel free to just pop it in chat.

A

All right, moving on Caroline Paul's welcome, says, given a current updating deployment rolling in a new replica set and rolling out to the previous one which replica set our pods taken away from. If a user or scalar decreases the deployments replica count.

C

It should be the current the single leaders, usually a single rotor, cassette active and that could be wrong depending on what's going on, but for the most part its it should be whatever the the last interacted with one is I believe and usually there's only one that I've seen but I don't even watch out, because that's that much anymore, so I'd be interested in more like a top layer like what is the problem that they're seeing or were they trying to investigate.

A

uh Caroline, wouldn't give you some time there to give us follow-up, feel free to just keep typing. There are some questions that I appear to have mists. So let me go back here. While we let that one stew for a minute here, a Sivan, kena, pulley, hope I got that right, says: hi I have a question: is there a way to specify certain pods in a deployment to get killed when the horizontal autoscaler scales down the deployment? Maya looks like you've answered this one, but I wanted to get this one on the video sure.

C

Yeah, just just really quick, that's I mean the HP references, the deployments, everything that's part of that deployment every instance is impacted, so I. In that case you have to make a separate deployment or something like that, which deployments track odds through labels. So you know, if you wanted to pull them out, you could remove the labels of those pods, etc, etc. So yeah.

A

Alright and another follow up here: Vishnu Prasad ass. Is there a project or tools that would help us configure how and when to auto scale nodes up and down like in eks, node groups went to scale down the nodes, especially mainly because certain loads can't use them on those metrics like cpu memory are all the time scaling up and down. It looks like you've been answering all the questions in chat before we get to them so.

C

The easy ones man I've lived in auto-scaling yeah, the cluster autoscaler, is great for that. It's it's I wouldn't say it's the the most stable, perfect production piece of software, but it gets the job done it logs, I think as AI config map you can reference for status and kind of it's looking to make and I've never had an issue with it talking to a toes API to change, ASG sighs, which is effectively what it's doing to scale the entirety of the cluster.

C

So the automation, where it kind of senses, everything through labels and tags and whatnot is, is really good and the home chart is great. We use it right now on all of our monsters. So one thing is multi. A-Z is a little tricky. You might be out of balance in some cases, but I won't get into that. So there's docks there. Okay.

A

So, let's, uh let's circle back to Caroline's question: if she, if they weren't listening, given a currently update in deployment rolling in a new replica set and rolling out the previous one which replica set up odds, are taken away from if a user scalar decreases I think we asked her follow-up info on this one right, yeah.

C

We yester problem I actually am looking at our live production cluster right now and I actually went to the real flickers that view in my k-9s interface and I only see for any given deployment. Let's call it edge platform, it's got I see like 10 replicas sets here, and all of them are zeroed out, except for one which actually is active and has the active number of instances. So in that case, I think that HPA, let's say that's editing your deployment.

C

The number of replicas should probably be working in just that replica set because you're not draining that realistic replica set entirely because you're not doing a deployment or anything like that. There's no need for it to be killed at all. So mmm that's my understanding, so maybe there's better Doc's or something around that. Ok.

A

Awesome and with that we've reached the end of the queue of questions. So, if you have questions feel free to ask them we're gonna get to Long's question next, which is the last one, so we probably have time for one or two more so get them in metal. Motsi welcome, ass, says, just saw, monitoring mixing is being discussed here, there's even a slack channel dedicated to here on the slack on the kubernetes slack. So thanks for that link, that's always useful long.

A

It welcome to the show asks: is the stable slash, metric server supposed to work out of the box, or is it absolutely required to add the cubelet insecure, TLS flag.

A

Sounds like some red flags immediately went off when wouldn't even when they were asked to pass that flag. It's like hey, wait a minute.

A

And what we still on that thing, they.

B

Fix that didn't they I am.

A

B

A while, since I've installed the metrics everybody sure they fixed that at some point.

D

Metric said well: I have to pass that flag, I didn't yeah, so I don't know that.

A

Might be a that might be a follow up, we might have to dig into why that is there I see lots of people typing, though. So, let's see what the situation is.

A

I, don't know, that's is that what you use for is that what you're asking you.

A

Alright, somebody remind me I need to tweet this monitoring mixing his channel one of the things while we, while we're stewing on this one for a bit, is you'd, be surprised how many channels we have on the kubernetes slack on just about everything cloud native related, so usually I could just hit ctrl, K and slack and whatever channel thing you're having a problem with, you can usually go to that channel and usually the developers are there and it might get you a better answer than just posting it in kubernetes users, which almost has a hundred thousand people in it, and it's pretty hard to get focused content.

A

Okay, so long I guess has a different question: is stable, slash, metrics server? What to use for HPA is the gold standard. Is that what you're so pissed a skintight? What you're supposed to use- and my I saw you- pay something but then delete it. So I'll give you a second to and.

C

Then I'm just a senior right now, um yeah.

E

C

I was just like you better, be if there's a heart for the metric server, which is also what you BBC I, was gonna base. Gate close the question like I'm, not sure what they do there. What the defaults are. That's always worked for us, even through everyone, 12 to 115 or right now, so like I'm guessing it does. I would look to the templates in there and see if that's an option they passed by default, so that would be yeah that'd, be my two cents.

A

Anybody else have an opinion on this see 10 mentions of to Prometheus, don't install the metric server.

A

Okay, here we go so what would this panel suggest if someone needed to use HPA as a source of metrics? What do you, what are you using my weight.

C

So HP isn't a source of metrics, it consumes metrics. I guess is the differentiator. Oh I.

A

See I see what you mean, yeah.

C

So like in our in our case, we have a data dog cluster agent that provides and an external metrics object, I guess that has metrics as part of it. So, okay, we can leverage those in HPA and say HPA scale on number of requests going to each instances. We want it to be 100 so balance it out kind of thing. So.

A

It looks like it looks like people are typing, but let's give them a chance to come in Oh Monty says: I really recommend the Prometheus adapter. If you weren't prometheus anyway, use the same metrics that Prometheus already ingests yeah.

C

Actually, I guess I'll pose this question at the rest of the panel.

C

What is the go to for Prometheus right now for the sound like that the standard kind of ultimate deployment I know in the helm, Official Charts, there's a Prometheus operator I, think that seems to include pretty much everything that you would need, including the node exporter, alert manager, keepsake, metrics, Griffon, etc. Is that kind of still the go-to or is anyone using anything else.

A

We use- oh, my goodness. We used two Prometheus operated, mom charts that we also wrote a secondary helm chart that we use internally that standardizes our Prometheus CRD definition, so that we have a pretty consistent setup of what we do for Prometheus, whether it's in our cloud or hedge sites, but.

C

That's worked really well for us. So far we haven't explored many alternatives because it ended up being so good awesome cool thanks thanks for sharing that yeah there's a scurrilous one, but that I don't think. Oh, that is still updated. Okay, yeah.

A

So they say cube queue. Prometheus is a standard. How much art is based on that as well yeah. It looks like my uh metal mater unit is a works on prometheus, so, yes, datian from a maintainer gives you extra weighted. So he says yeah we're actively maintaining that repository lots of things going on daily. So thanks very much for that that helps.

D

So you don't really have to run from a few separator. For example, we are running vanilla, Prometheus, it's just that a lot of stateful said it's pretty easy to configure so yeah. That's also always an option. We should definitely.

A

I think a Prometheus officer should be a good one. That's what I was writing down. That's why I wasn't paying attention. I was like we need to have a session on this, so good to know alright. So we are really running out of time. I really appreciate everyone for showing up, especially those of you, sharing your recommendations in chat. I'm gonna give away two t-shirts here today. Here's what's gonna happen, I'm going to tell you that you want a t-shirt and then I will PM you with the store.

A

You can always get all the goodies from the store from the CN CF big shout out to uh Google stock X, Microsoft, VMware and Parv. Where do you work at again, eww eww and vapor dot IO for letting their engineers sit on this panel. If you're interested in sitting on this panel, we're always looking for volunteers, so we will have to have like a rotating set of people, so we can cover different levels of expertise and different areas of the project and with that um there is a Prometheus ecosystem every call every month.

A

If someone could drop a link to that in chat. I will make sure that gets to the show notes. The winners are Vishnu Prasad you've won a Kira, Nerys t-shirt and a Sivan Ken pulley. Thank you for your questions. I will follow up with you. We are gonna, go live in another two hours. Geoffrey Sica will be grabbing a bunch of West Coasters and we will go live again in this channel. So if you're listening- and we do these a third Wednesday of every month- it's always a third Wednesday.

A

We try to have as many sessions as possible if you're, interesting and helping out yeah, just let us know, feel free to hang out in the channel. We like to keep it's like a nice. Much smaller group than trying to you know, get help in a channel with a hundred thousand people panel. Any last thoughts you.

C

I really just wanted to thank everyone who is asking some great questions, definitely feel free to at me or notice me or anyone in the panel I think and we'll help you any way we can I feel like.

E

C

Conversation a little too much I apologize for that. The rest of the people here are much smarter than me. I've just been delving an auto skilling. Yes, too much lately, so that's life, but thanks guys for for coming all.

A

Right, Dave, Parker anything else. That's.

B

Kind of big and thick Oh Mario Mario- it's always great to be here- I'm, always learning something new. It's it's fantastic! So thanks for.

A

Every found the day we found out that someone had rewritten all these dashboards and we could have just used those this whole time. Yeah.

F

A

D

Right with that, thanks.

A

Everyone will see everyone in a few hours stay safe out there. Everyone Thanks.