Red Hat OpenShift Ask an OpenShift Admin | Red Hat Livestreaming, 10 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ask an OpenShift Admin (Ep 21): Etcd - the heart of the control plane

Description

The heart of the Kubernetes control plane in Red Hat OpenShift is etcd, a key/value store used to persist configuration, status and requested state for everything happening in the cluster. Slow etcd means a slow control plane and slow API, which ripples out to every aspect of OpenShift and your applications. In this episode, we dig into etcd, focusing on performance requirements, protecting the data, and periodic maintenance, so don’t miss this one!

More about etcd:
https://www.redhat.com/en/topics/containers/what-is-etcd

A

A

Good morning, good afternoon, good evening, wherever you're healing from welcome to another edition of the uh ask an openshift admin show here on openshift tv, I am chris short executive producer of openshift tv. I am here with the one and only andrew sullivan, we're going to talk about what today andrew special guest and everything.

B

We do uh etcd or or etcd, is it some sometimes called uh and first I want to say I'm super excited about that intro animation. um I'm.

C

Really happy with.

B

With uh the way that that came out and can't uh can't say enough, nice things and great things about the team that put that together for us. So uh if.

A

I am so happy to be watching thanks this morning, yeah yeah, but seriously it's awesome. Thank you very much.

B

Yeah, so thank you chris. uh This is the open shifter. Excuse me, this is the ask an openshift admin office hours live stream, so the goal of this is to give you our audience the opportunity to ask us, quite literally anything. That's on the top of your mind. You know we're here to kind of help. You work through all of those issues, any outstanding questions that you might have and help right, whether or not that is, we might have an answer or going back and finding the answer for you.

B

uh So please, at any point in time um you know feel free to post into chat across all of the various platforms that we're streaming across. To ask us any questions.

B

However, in the absence of those questions, uh hello, ali, in the absence of those questions, we do have a topic, and today's topic is one that I have been super excited about. Since the day I put it onto the agenda, it's something that pushes all of those those good happy buttons for me, because it's kind of in the weeds and it's kind of super geeky and how it works and there's a lot of different stuff that goes into it. So with that in mind, I am very happy to introduce our guest, anant and anan.

B

Please introduce yourself, I'm not going to butcher your last name. I'm sorry.

C

No problem, andrew thanks for the wonderful introduction. My name is anand chandra mohan and uh you know glad to be again on this openshift tv. You know chris, is an awesome host. I think a couple of months back, we did a you know. Few sessions on. You know windows now glad to be. You know, uh working with android chris again for hcd, uh just as a background, I'm uh the product manager for hdd, so I'm responsible for the roadmap planning and future direction of the product.

C

So you know any questions you may have about the futures or the roadmap. You know here to uh you know, answer those away or you know, find out answers that you know we can come back with.

B

Yeah and and as you pointed out, you were here to talk about, you know the windows container side of things and, like basically, all of us, you have multiple roles, multiple responsibilities, um so for anybody who caught those episodes before you know, don't don't be alarmed or don't be surprised and on hasn't abandoned any of those. I don't do we ever get to abandon things. I don't I don't think so. We only get more.

A

We deprecate them yeah yeah, so.

B

uh It's it's just another hat that he happens to wear right so um in in unusual and somewhat unusual fashion. Usually I have an agenda for these um where I I it's a loose agenda right. Everybody knows I, like my sticky notes um this one. We have an actual like word document or or uh yeah, an actual google doc.

A

And it's like two and a half pages, so it's like it's ready for a book to be written.

B

It honestly is uh so I'm I. I really want to devote as much time as possible to today's topic, because I think it's a big topic. It's a complex topic, there's a lot of questions about it. It's a little bit of a black box right etcd is to a lot of folks, but I do want to quickly cover you know in traditional ask an open shift. Admin hour tradition. I used traditional twice there.

B

Didn't I um just a couple of things that are top of mind, things that I think you all should be aware of that have come up either internally or externally.

B

So the first one very short, very sweet, very simple, openshift 4.6.21 is on the cusp of being the first 4.6 stable, upgrade to 4.7, so it is in the and I'm gonna go ahead and share my screen here.

B

Click share there we'll move you your lovely pictures over here. So it looks like when I look at you, I'm looking at the camera. uh If we go to github.com.

B

I don't think I'm typing that right, but I can't see because your pictures are there.

A

Move our picture.

B

If we go to the cincinnati graph data repo, so you see I'm just in github.com openshift cincinnati graph data, so this remember is the repo that it pulls all of that information in for updates upgrades et cetera. You can see that there's an open pr to enable 4.6.21 in the stable channels so usually takes a couple of days after these are submitted. So, like I said it's on the cusp of being available, that doesn't mean that it will. That doesn't mean that they still won't.

B

You know there could there might be some leap breaking thing that will cause it to delay, but for anybody who's on 4.6 and ready to go to 4.7 in the stable channel we're almost there.

B

You know, I think all three of us were here back in the 4.4 days right. The upgrade from 4.3 to 4.4 and stable took a long long time and the engineering team, the product management teams, have done a lot of work to make sure that doesn't happen again, that we can make those those updates happen. A lot faster.

A

Yeah and it just gets better and better with each release. I feel like too.

B

Agreed uh so the second thing that I wanted to bring up real quick um this this one has caused a little bit of reoccurring drama. uh So if I go to the docks here- and I want 4.7 docs and I go to installing- and I want to install to vsphere, if I look at the requirements here for installing to vsphere it's interpreted- it's not a stretch to interpret that we're saying the minimum supported vm vsphere version right using nsxt right. It's easy to misinterpret that and say oh well.

B

In order to deploy openshift to vmware, I need an sxt. So we've got a couple of bzs open with the docs team to basically clarify it's not required to use nsxt. If you want to deploy to openshift or openshift to vsphere. But if you do choose to use nsxt, then there are some version requirements associated with that. So just a quick one. This comes up every week or two I get an email asking you know. Oh it can is nsxt really required.

B

Well, no, so just be aware of that, uh and last but not least, I had a question come into my inbox about the default storage classes created with ipi and upi.

B

So if you deploy to, for example, vsphere using ipi or upi, the installer will automatically configure a storage provisioner the entry, vsphere storage provisioner in that case and will automatically configure a storage class. It's named thin. It is configured as the default class and it's pointed to whichever storage domain or data store, the vm sit in.

B

So you cannot delete that storage class.

B

If you do, it will get recreated and that is done by the cluster storage operators go back to github and if we search for storage here as soon as I spell storage right still can't spell storage right here we go cluster storage operator, uh so this cluster storage operator is responsible for making sure that there is always at least the one defined uh storage class for each platform in the cluster.

B

So you can, for example, change that storage class. If I want to keep the thin storage class, I want to keep it as the default. I can go in and modify it to point to a different data store. For example, I can create a new storage class and mark it as default, but that thin storage class in the case of vsphere and it's different for each one of the platforms, but that thin one has to be there and if I delete it, the operator will recreate it and remark it as default.

B

So just be careful if you're doing that, make sure that you're aware of that behavior um if it seems weird like oh, I can't delete it or oh every time. I you know, try and set a new default. It goes back well make sure you're not setting a new default and then deleting the original which will cause it to recreate yes, okay, so um I'm going to ask you a question anand and then I'm going to catch up on on chat while you answer and I'll be I'll, also be listening to you, of course.

B

um So the first question I wanted to ask great so coming back to our topic of the day right etcd. So what is etcd what what? What? How does it? What purpose does it serve kind of? How does it function inside of you know not just openshift or kubernetes, but in general.

C

Yeah, actually, you know I'll keep it simple right. So etsy is basically a key value store for you know, storing the state of your. You know cluster. So your config maps, your secrets and you know other you know protected resources are stored in the lcd and you know, like the title of this. You know session says it's really. The heart of you know the control plane.

C

So uh that's you know, that's essentially what it is, and uh you know it's an open source project and uh um it's you know very much uh uh a critical part of the control plane and uh you know: we've built a hcd. You know operator that watches for changes to the cluster and you know, reacts in a certain way and uh you know over the course of this session.

C

We talk about how you can you know back up the store, how you can restore the store, how you can encrypt decrypt the store, but it's really you know at the core. It's a key value store for storing the state of your. You know your control plane.

B

And I I think, that's important, because it's kind of a database but kind of not a database right. You specifically use the term key value store and how is that different from a database.

C

Yeah, it's different from a database in the sense that this is not. You know your traditional oracle rdbms where you store inventory data, or you know, customer you know, fleet management data. This is, you know, meant purely for storing the internal state of the cluster, uh and this is not meant for any user level. Workloads like for storing you know like, like I said.

B

C

Information of your your your shipping application right, that's how it differs from a traditional database.

B

And etcd it was originally created by the coreos team before they were a part of red hat.

C

B

Okay, so eccd key value store, it is used for the persistence layer inside of kubernetes or not all kubernetes is but many kubernetes is and most relevant for us today, openshift.

B

So what types of things does openshift and kubernetes store inside of there.

C

Yeah you could use it to like, I said, store, config maps. You know secrets, um you know routes, you know overall access, tokens, oauth authorization tokens. uh You know those are the bunch of things. You know you can store inside ncd, okay,.

B

Yeah, usually the the way I think about it is anything. That's yaml is inside of etcd right of anytime. I create some sort of object, it is being stored inside of there and it's important because okay, I create a new pod right. I submit it to the api as that yaml definition. The api server then modifies that cd right etcd it adds in that object and then the scheduler does some work right. The scheduler says: hey, there's a new pod definition. I need to.

B

I need to schedule it and it will update that object with the chosen node, for example, and notify that node that it has been. You know, hey, you need to go in and start executing this pod and then that node will instantiate it and it's making updates to that object as well. It's adding status information, it's adding in all of these other things.

B

So it's not so much that you know it's not a database in the sense of you know, row values being stored, but it's critically important to the functionality of openshift of kubernetes. Every object has many different operations that are happening around it and with it and against it um so yeah. I think that was accurate.

B

I'm I'm looking to you to to no that's that's your sport, okay, so I I want to, and usually usually these conversations when we talk about etcd turns into a performance conversation, and I say that because performance is most frequently the issue that people have with etcd right and it ripples out into all different components, all different aspects of the cluster. So, as I said, you know, if I create that object, you know a pod if it takes too long for that to be committed into the database right. That can cause other issues right.

B

It takes longer to schedule once the scheduling happens, it takes longer to actually be instantiated. Once it's been instantiated, it takes longer for each one of those operations and there's you know hundreds or thousands of operations, depending on the size of your cluster. That could be going on at any point in time.

B

So slow performance affects basically everything and I want to start the performance conversation and and kind of build into it by starting with how does etcd actually work and usually for me, this starts with the raft protocol, and what does that actually mean- and I actually have a very convenient graphic here that I'll point in I'll plug into uh chat here.

B

So this is a very the secret lives of data.com. um It's a a pretty simplistic explanation. Chris, I see you smiling right. This one is, is pretty well known about how to break down and understand the raft protocol and how it works with data persistence, which is what etcd uses I'm just going to. Click continue here to more or less show what it looks like and actually I think I want to jump ahead.

B

Yeah a little bit.

B

I want protocol overview so essentially the distributed consensus is the important part. So one of the nodes will be elected as a leader come on, we'll get there or and then the others will be followers.

B

So in this instance, we have the three nodes that will go through. They all start as followers. They will elect amongst themselves a leader based off of whatever criteria happens, to be defined, and eventually we end up with that that defined elected leader so from a change perspective. So let's say this green dot. Here is the api server. The api server communicates with the leader and says set some sort of value, so that value goes to the leader, the leader then logs it, and then it sends that data to the followers.

B

The followers then log that data and then reply back, hey we've got it and then it goes back and it says: okay, we have enough nodes right enough of enough of us have agreed that we have this data. Now, let's commit the data right and at that point the leader will return back to the back to the requester the api server. In the case of kubernetes that hey, your data has been saved.

B

So the really important part here is you notice that once the data was sent to etcd to the leader it had to go across it had to get written it had to go across the network to the followers who then had to write it. Who then had to come back to the leader and say hey we got it. Who then had to go back and commit it? Who then had to come back and say? Okay, I've actually stored the data, so you notice there's a bunch of different operations happening there right there's.

B

At least you know six rights and what eight network traversals happening across the three nodes in order for that data to be committed.

B

So this is why, when we talk about in in just a moment I'll start, you know we can start talking about the actual uh requirements for the actual rick um values right, so the the latency, the network for storage and network, etc that we suggest this is why they seem super low, because there's a lot of that traffic that happens and they stack on top of each other. So one request is a bunch of difference.

B

One request to write data is a bunch of additional requests on the back end at the database level or at the etcd level.

B

Rather, all right, so I will pause here um chris jones.

A

Some questions please.

A

How does that cd, I mean we're going to talk about this in depth, but how does scd work? What are the dependent services or objects for std, which I think.

C

Is an interesting.

A

Question because, like ncd, is the heart, but there is some dependencies out there, like storage.

B

Yeah, I I'm an anons I'll, throw out what I think my answer is- and please um add on correct me if, if needed uh so etcd itself doesn't have other dependencies in kubernetes um and I'll put an asterisk on that, but it does have infrastructure requirements. So, of course you need to have the storage available right. You need to have the network available that type of stuff, so the deployment inside of openshift really has a dependency on the cluster, the etcd cluster operator.

B

So this is what I have here I'll paste this into the chat, and I say that because the operator is what's actually configuring the nodes or the pods in the cluster. um So let's let me switch over to my uh my terminal here and if I do a oc get pod open shift, etcd namespace. uh So we can see we have these three pods up here at the top one on each one of the control plane nodes and if I do a describe against that pod.

B

Oh yes, I know I have to supply the object type. So if I describe that pod and come up here to the let's make this a little bigger, so hopefully it doesn't wrap quite as badly.

B

So if I find my where is it etcd pod right here, you can see that it has a lot of values, a lot of data that's being passed to it. So all of this configuration data comes from the operator, so the operator is the one that says hey. This is where you can find your certificate, and importantly, once we get down here into the actual script we can see down here. It's saying hey. This is where all of your peers are at. This is where you can find the other nodes inside of the cluster.

B

So while it isn't the concept or the implementation of etcd itself doesn't require something else in kubernetes, openshift does, and so this is why we have like the whole bootstrap thing and when we instantiate a cluster, we stand up the bootstrap and it creates a single node cluster that then instantiates the control, plane and hands off to that now. Three node cluster, and so that's how we work around the chicken and egg. So I'll be quiet now and let anons.

C

Yeah can I share my screen for a second.

C

Please yeah, so I think you know andrew you are in the money with all the statements right. So it's you know has no prereqs uh it's a day, one. You know operator like you, you know mentioned so it's installed. You know with the cluster key part of the control plane. One of the prerequisites, though, for having good cd performance is, you know a good storage, backend right and you know, as you know, open shifters support on a wide variety of platforms. You know private clouds, public clouds, you know edge networks and whatnot.

C

So one of the key things to get good performance from your hcd operator or from your hdd database is to make sure that the storage backend- that's you know back in the cluster. Is you know performant enough, and here is where you know this utility called fio comes in right, and so one of the things that you might want to do before you install your openshift cluster um and you and you're worried that you might run into you know. Hcd, you know. Performance degradation issues later on is to run fire right.

C

So if you run fio that will give you a status of whether your storage backend is, you know good enough or not right and again, even if you're on a cloud platform. Let's say you know aws or azure or gcp. You know you. Might you have access to a wide variety of disks right? You have access to. You know ssds.

C

You know ultra hdds, uh ultra ssds. You have access to nvme, so ephemeral storage. You have access to so many. You know types of back-end storage and each of them you know, offer a different. You know uh you know type of performance. So one of the things you can do is run this. You know fire command, so I'm going to copy this put it up here, and so, when you run this command, it usually takes a few seconds. You want to watch for one metric right.

C

You want to watch for a metric called the f data sync and you want to make sure that the 99 percentile of that metric is less than 10 milliseconds right. So we let this guy, you know, run for a few seconds and we observe you know in my case. My cluster is, I think, on a cloud platform, and we will see in this case if this installation is going to be good enough to support a good serial performance.

C

So that's the output, and this is the different. You know percentile metrics for f data. Sync and you can see the 99 percentile is 465 microseconds. So that's around you know 0.4 milliseconds, that's you know way less than 10. So we are, you know in pretty good shape here, you're in real good shape.

B

Yeah, actually, the the f data. Sync is the second second block. Oh, so it's still really good at five milliseconds. That's right! So.

C

Yeah, that's right so yeah. So this is, you know five milliseconds and way less than ten milliseconds. So I think you're still good in this case. So.

B

Sorry to interrupt you, I I think we should have you on more, because you're you're literally doing what was my next step. So thank.

A

You you can read our minds.

B

So this is good. um I I I don't want to interrupt you if you're going to lose that thought, but we do have a couple of questions. um If that's okay, let's.

C

Take the questions.

B

um So we say that etcd goes across the network. Does it utilize its own configuration for connecting to contacting the other members of the etcd cluster, or does it rely on things like kubernetes services to find and discover those members.

C

Good question: I will need to find the answer to that. That's a great question.

B

So I think the answer is it uses. So this is one of those things that the uh the operator does. The operator basically points it to the other cluster members automatically uh so before the operator so prior to um so inversions, openshift 4.3 and earlier we use the dns srv records. So if you happen to go back in the docs and look you'll for during install you'll see that we require all of these additional records for etcd for srv records pointing to etcd.

B

So that's how we did that peer discovery, but the operator I think, takes care of that, and I thought that was one of the things that we see over here.

C

Yeah, we have got rid of you know the uh the dns requirements, but I'm not sure whether that goes through services or ingress, or you know what type of network protocol it uses to communicate with other lcd parts lying in.

B

Yes, so if you look at um so that command that I did earlier, which was the uh oc describe pod on one of those etcd pods and the openshift dtcd namespace uh you'll- see, for example, there's some environment variables like uh all etcd, endpoints or node, underscore control plane, underscore identification, it's etcd, name, etc. Url host right.

B

Those are the values that it's using to find those other members- and importantly, I I think this is why um the the operator specifically, is why, like we, can only tolerate losing one of the reasons we can only tolerate losing a single node or recovering a single node of etcd at a time, because the operator has to be there to reconfigure the surviving nodes to point to the new node. So that way they can all recover.

B

Let's see, there is another question up here: uh the revision pruner. So what what is the purpose of the revision.

B

B

And I don't know that authoritatively. um I think it's very much to your point walid. I believe that it removes older versions. So essentially, if there's a configuration update, it is responsible for uh pruning right, the old pods that happen to be left there.

B

um So I'm going to share my screen again real quick to help answer your question about running fio, on core os, there's sam, so and the reason I'm going to share it is because, if we switch to the documentation here, so we have this recommended etcd practices. Let me paste this into the chat here.

B

If we scroll down in this documentation session, you can see that we have some podman commands that we can use to run and it would be pretty easy to translate that into right, running inside of a pod or running inside of uh openshift.

B

So that being said, um I actually did create. Let me see if I have the link handy um a while ago, I created a very simple gist that shows how to run etcd inside of an open shift cluster, but I think it's important to understand that this just schedules it as a regular pod right.

B

Let me scroll down here right, so it's not going to be running on a control, plane, node, which is really the important one, because etcd is actually running on a control, plane, node and while most of the time it is going to be very similar performance.

B

It is not always going to be the same, and in particular what I mean by that is, for example, the operator and the installer are designed by and created by and tested by engineering to take into account a lot of different quirks, the biggest one or arguably, the most well-known one is azure.

B

So with azure we deliberately tune, and- and I want to get to these um in a few seconds- we deliberately tune some of the etcd settings as well as we use things like. I think it requests a one. Terabyte drive for the control plane nodes to account for all of the iops and all the other things that it needs. So the etcd or excuse me, the control plane nodes may be substantially different than a worker node.

B

In addition to just the worker nodes, you know they come from machine sets oftentimes, um so it it's it's close, but not always the same. Just keep that in mind and anand. I think you had something to.

C

Add yeah sorry, I was replying to a few questions on chat yeah, so you know a couple of things again. You know going back to the fire example right so from a product perspective, one of the things we want to do is we want to integrate fio as a part of the install process right. So that way you don't have to run it and then you know, run the install if we find that most of our customers are. Actually you know having a need to run this. We.

A

C

It as a part of our you know day one install, maybe when you you know, generate the manifest or when you generate install configs or maybe when you say you know cluster create uh you want to make sure the file utility pops up. You know, checks the status of your disks and gives you a report right whether you're good enough, uh your backends are good enough or not. It's not good enough. It's going to splash a warning.

C

We don't want to block install at this point, but we want you to be at least aware that you know your fsync is at this level of percentile, so you could expect a good or a bad. You know performance from mcd, so that's one of the things we're looking at is to bacon fire as a part of the openshift install.

C

So you're aware you know, prior to time uh by the what kind of performance you're going to be getting right. That's you know. Point number, one point number two. Let me actually again share my screen.

C

Here point number two:.

C

C

A

C

Okay, so now you uh have determined that you know your hdd is not going to give you good performance. You know what what are the options you have right? So one is obviously you can upgrade to better storage like you can, you know upgrade to nvme, maybe you can upgrade to ultra. You know hdds uh and whatnot. The second option you have is you can always mount vadlib hcd to a separate. uh uh You know directory right and we're going to provide documentation.

C

I believe it's already there for most of the cloud providers, aws, azure and vsphere, and I think, even bare metal to show you how you can mount external volumes for your vadlib cd to a separate. You know high performance disk and attach that to your cluster right and I'm going to dig that out. So uh give me a few seconds as andrew is walking through the presentation.

C

I will actually provide documentation on how to mount wildlife tv to a separate secondary test, and uh one of the things we're looking at doing from product site is how you can bake this in again, as a part of you know, maybe day one or as a day, one install right so as you're installing through day one. uh If you want to use an external secondary disk for xcd, we want to provide you the knock. We want to provide you an option of doing that as a part of q, david and stuff.

B

Yeah, I know that's something that has been asked about um somewhat frequently. Since you know, openshift 4 was released and you know, openshift 4 with coreos has been a much more rigid configuration and a lot of people with openshift 3 would dedicate you know a disc to etcd so right. I know that is something that is very much look, look being looked forward to um so I'm going to share my screen again as well, and the reason why I wanted to do. That is because I want to share this kcs.

B

So this kcs kind of summarizes all of the things that we've been talking about, thus far and kind of breaks it down into different component pieces.

B

So, for example, this paragraph here is arguably one of the most important or this section here so applying a request should normally take fewer than 50 milliseconds, where a request refers to the amount of time that it takes for the etcd cluster to commit a or return a piece of data, and that 50 milliseconds is going to be the aggregate of all of those storage and network latency operations, and that is why we make kind of stringent you know requirements around the uh around the performance.

B

So we can see here, slow disk, you know, p99 duration should be less than 10 milliseconds, um so database size related issues. um So we'll talk in a little bit about what defragging and compaction is and how that can impact the performance as well.

B

Overall latency comes from network latency as well as storage latency right. So here you can see that we don't provide a specific network latency requirements. Instead, it's wrapped up in this 50 milliseconds number. So if your storage is slower, then you need maybe need faster network to be able to reach that 50 milliseconds right that type of stuff. So it's it's very much. A of a balancing act.

A

B

A lot of knobs.

A

As well, I was about to say that you can dial in and dial out to kind of get to where you need to be on a large infrastructure kind of setup.

B

Yeah and the other one that I wanted to uh to share is this kcs, so I I created that um super simple gist, so this one is very simple of here: they recommend doing an oc debug into the master node to do that same pod. Man command nice, I'll post that kcs in there as well. I'm gonna save.

A

That one, because that's a handy one.

B

Yeah and let's see what other links do I have out here- uh cluster storage operator- we already talked about that.

B

We already talked about that. um So let me check my notes in and on. Please interrupt me at any time if you're, uh whenever you're ready.

B

So one of the questions that we get a lot is basically what happens if my storage performance is right on the cusp right. What if it's right at that 10 millisecond range? What if it's you know when I, when I measure operations? Oh one thing I should uh talk about here- is lcd control.

B

So, if we go to where am I looking here? uh So if, if we go to the official etcd documents, we have a command which is etcd control right. This is how we interact with it and you can connect to you. Can utilize um from a control, plane, node, you can use etcd or lcd control uh to interact with it and specifically, if we come down here to come on.

B

If we come down here to the operations guide and we look for performance, there's going to be a set of commands that we can run against the cluster and the one that I'm looking for, that I'm not seeing at the moment so there's the benchmark, cli tool, there's also etcd control, etcd, cuddle check, perf dash, dash load equals, and then we have small, medium, large extra large so effectively. What that's doing is the ncd control is running a workload right.

B

It's doing a bunch of queries and a bunch of other stuff against the cluster and then returns back some data from that some values, and that was where, where did I just do? What did I do with that? Just um I thought that or I think that one of the things that I was showcasing in there um is the uh the output of some of those std control values.

B

So essentially, this is a good way to determine or uh identify what that and let me go back over here to our kcs right. What that request, latency or what that overall request time happens to be and gauge whether or not your cluster is capable of meeting those requirements. um So let me catch up on chat, real, quick here, um fastest disk you've got to etcd. Yes, chris, I see that's you um yeah. If you want to check simulate disk events, think you will lead for the link I'll have to check that one out.

B

uh Oh anand, I see you're answering the question about how the pods communicate with others. They use the host network.

A

Thank you for saying that, because I was about to type it when you hit enter.

B

Christian yeah, I I'm an equal opportunity offender. I try and switch back and forth between all of the different pronunciations of all of the different control, cuddle and ctls, whether it's for uh you know, kubernetes open shifts, etcd um trident.

B

Trend, it's the the netapp csi preview, so I think I'm caught up on chats. Please um just a reminder: anybody feel free to submit questions at any time. um Yeah.

C

Yeah and andrew as you're, you know looking at the questions the two links I would like to share with the group here, one as I mentioned is you know how you can use the secondary storage. uh uh You know for hdd, so this is a case article on how you can mount secondary storage. uh I guess this is not just for xcd right, so I mean any you know. File system in openshift, you know, could be wired with containers.

C

So this is that generic article, and specifically for uh hcd and specifically for cloud providers like you, know uh bare metal and vsphere. There is actually detailed instruction on how you can mount mad lib containers validity and for that matter, just flash wall on a separate. You know file system and uh let me paste a link to that as well, and the key thing I want to make at this point is, as you can see, these are you know day two tasks right after the cluster has been created.

C

If you want to mount slash wireless hd to a separate, you know secondary disk. That is, you know highly performant. Here are the steps to do it as a part of day two, but we're looking at taking some of these steps and making it as a part of the day, one operation, so you can do it right as the cluster is being installed, and you don't have to worry about doing it. uh They do so. That's something we're looking at from a product.

C

B

Okay, so in the interest of time um I think I'm gonna move ahead a little bit so we've we've beaten the performance side of the the house. um We've beaten that into shape.

B

I think, if anybody has any questions, please let me know we have a bunch more resources that I'm going to include in the show summary blog post, uh so those go out friday mornings on openshift.com blog, so I'll have just a huge amount of links literally where chris and I were talking like two and a half pages of of notes, leading into this session I'll include all of the links and stuff there um so troubleshooting. How do we identify? How do we know that there are issues happening with our cluster?

B

um So let's go to my cluster over here and the first way, as always, will be you'll see logs about it right.

A

You'll start seeing some.

B

Start seeing some events what's going on here, cluster, oh you gotta log back in apparently it logged me out ah there you go my token expired. um So you'll start seeing some events. You'll start seeing some alerts happening inside of here right, there'll be all kinds of things that are basically saying, something's, really really wrong and usually it'll also ripple out into other things.

A

Other operators.

B

Yeah there are other pods that'll, all start complaining as well, so the usually when it becomes a problem, it's very obvious. So now, how do I actually determine how bad the problem is, and fortunately, in the performance dashboards here we have one that's dedicated to atcd. Thank you um so yeah and this one has been here. I think since day one yeah, I.

A

I don't call it openshift that didn't have this.

B

Yeah and we we sometimes add- and remove, charts and stuff like that that are here and there's also a tremendous amount of additional information available in prometheus that isn't in the dashboard. So if I jump back over here to this recommended host practices, documentation page you'll see that it. It suggests some additional values that you can check for that you can monitor inside of or via prometheus.

B

So the dashboard is just meant to be a quick and easy way. So for me personally, the one that I pay most attention to, or the ones that I pay most attention to rpc rate is generally how busy it is overall right.

B

So this you would expect this to go up as the cluster increases in busyness, and I specifically use the term busyness here and not scale so when or or because I can have three nodes that are absolute monster, nodes that have 500 pods deployed to them and those pods are, you know, super busy and they are generating a lot of activity, or I can have a cluster of a hundred nodes that have one pod, each that are not doing a whole lot and they're for so it's when we talk about scale for etd and how it affects the scale of openshift.

B

It is more about the number of objects in the database in etcd, rather than the number of or the size of nodes right type of thing.

B

So arguably the most important single metric um is this disk sync to duration. Let me turn that off, so we can make it a little bigger, uh so this distinct duration is literally the latency for writing values to disk for each one of the nodes, so pay special attention to that one. This is a good indicator if you're having issues with with that storage, latency problem.

B

If we look down here in the various peer traffics and client traffics, this can sometimes be an indicator of network issues right if you start seeing one node or two nodes that are having a surge in traffic or one node, that's way below the others. This doesn't count, so you can kind of always expect one node to have more peer traffic because that's going to be the leader, the leader is going to be having additional traffic uh but yeah.

B

If you see one, that's really an outlier, something like that, then that can be an indication and then, of course, the total leader elections per day. This one will be obvious, so remember when I showed the raft image. The first thing it did was elect a leader and then that leader is the one that handles all of the read write operations for the api server.

B

So if it's having trouble or if the other nodes are having trouble with it, they can basically say I don't trust the leader. I need to elect a new leader when that happens. It basically pauses everything, it doesn't do any reads: it doesn't do any rights, it just stuns the whole cluster and that will very much ripple out across the rest of openshift.

A

B

Oh, and that means the api server stops responding right, all kinds of other stuff. So, if you have, you know something as simple as like doing an oc logs, dash, f, oftentimes it'll cause the api servers to to to drop those connections and stuff, like that. So you'll start to see those leader elections happening. That is a strong indicator that there is something something bad happening inside of there um and then, of course, okay. We've identified something right: we we think that it's you know x or y or z. We see it in the events.

B

You can also, of course, check the pod logs. um I showed how to do that a moment ago on the cli just oc logs against the pods that are in the openshift-etcd namespace.

B

um Okay, I said before that we were going to talk about some maintenance tasks, uh so a anon. I think you probably have a better method of describing this than I do, and I'll catch up on chat while you're, while you're running through it. So can you tell us what compaction and defragmentation are within etcd.

B

And if not, I do have an answer and that I may just need you to. We need to rely on you to to say yes andrew you're, right or no andrew you're wrong, because.

C

Yeah, why don't you go ahead with your answer, I'll correct you.

B

Okay, so my my two cent or or my layman's version, is that compaction removes key history, so it's etcd is versioned right, as I add new data, it basically creates new versions of those keys. So when I compact the database, we can see here in the metrics we have this db size when I compact the database.

B

It just goes in and it removes all of those old versions, but when it does that it leaves holes in that capacity and that in that space utilized- and this is not necessarily a bad thing- you know particularly for talking flash media- it's not like. We have to wait for seek time or something like that, but at the upper end, when we're talking about a very busy cluster of a very you know, scaled up cluster, with lots of pods lots of other things stored in etcd. This database can be pretty substantial.

B

So after doing a compaction, it's usually recommended to do a defrag which takes all of that data and brings it down. Just like those of us who've been you know, administering.

C

B

A while right, just like we did with uh spinning media right, it brings all of that data, so it's contiguous on the disk and effectively at that point it frees it returns to the system, all of that unused now unused capacity, so it reduces the size of the database.

B

So this is why it can impact right performance right, lots of other things. So a couple of important things here. If you look at the etcd documentation what happened to the documentation here, we go it'll make a lot of here. Frequently ask and I lost where I had it inside of here. Anyways inside of here inside of the faq it'll, make recommendations and keep in mind.

B

This is the upstream, the the official etcd documentation not specific to openshift here, recover from low space quota compact defragment and then do away with the alarm right stuff like that it'll make these recommendations.

B

It is technically possible to do this manually inside of openshift. However, we don't recommend doing that and the reason is because the system does it for you so compaction and anon. Please correct me: if I'm wrong, I believe compaction happens. Every five minutes automatically and defragmentation happens whenever the nodes are rebooted, the rationale being compaction.

B

It's happening frequently, it's a relatively low. I o workload when it happens. Defrag, on the other hand, is a relatively high. I o workload. So if we were to do that against a running node in the cluster, it would basically the expectation is it would impact disk performance which would ripple out into you know other aspects of etcde performance right. So we automatically do it when the node is rebooted before it has rejoined the etcd cluster. So that way, it's not impacting other things.

B

uh Inside of there go on zoom banner get out of my way.

C

B

C

Right uh android, just to you know summarize, you know compaction is you know deleting key history and defrag is you know, reclaiming the empty space and returning it to the os compaction to confirm occurs every five minutes by default and defrag needs to be initiated by the admin right, so the xcd cuttle commands or the lcd control commands andrew that you're, showing uh you know, has a flag for defrag right. So you could just say a cd card defrag, something that can be initiated by the cluster admin.

C

It should take a few minutes and that should complete the d flag. The compaction happens every five minutes.

B

Yeah, so, generally speaking, unless something has something else has gone: adverse has gone sideways. We wouldn't expect for those to need to be done uh manually even in a very busy cluster. Assuming- and I know this is a big assumption, as as I have learned the hard way um assuming you are rebooting the nodes regularly.

A

Which means you need to be updating your nodes regularly.

B

Yeah so effectively, you know we release z streams every two weeks ish, so you know we can kind of reasonably expect most nodes to be rebooted at least once every two weeks, and of course, if you're configuring or changing other configuration with mco, that would result in a node reboot. That may be more frequent.

B

So just just be aware of that um yeah! Well, you did yeah every every one to three weeks is usually kind of the average. I I would think anecdotally for between node reboots.

B

um So the last thing that I wanted to talk about just very quickly is and then ananda. I want to hand over to you with at least five minutes to go to talk about. Roadmap is backups, a lot of people talk about backups, and I think it's really important that we understand that for etcd.

B

Backups are not necessarily great for disaster recovery, and you may be thinking, that's literally, why we create backups andrew right. Why are you telling me that I can't use that for disaster recovery, so the backups are literally a snapshot of the database of the etcd data space and remember all of those objects represent the literal configuration of kubernetes of openshift.

B

So if I were to suffer a catastrophic event at my site right with my database, I can at that same site with that same open shift right. It's the nodes are still up and running. Just something happened to edcd. I can use that backup to restore it and it'll come back relatively quickly right. So the support folks are super good at figuring that out and getting things up and running again.

B

But let's say that it's a disaster, recovery scenario right something happened at the primary site and it's not coming back, not for a good long. While I need to recover on the dr site that dr site is almost guaranteed to be different in some way and if you recover that backup etc database, it doesn't know that it's going to come back up and it's going to think that everything is exactly the same as it was so. My personal opinion- and I I think I can't see christian but I'm pretty sure, he's gonna, be nodding.

B

His approval is to take kind of a get ops approach to dr right of right. Don't don't literally move everything instead reinstantiate it on the destination side. um I know that's probably a little bit controversial, please feel free to beat me up in the comments or- or you know publicly, on social media et cetera. I'm happy to defend my position on that, but sometimes we have to think a little bit differently about how things are done in kubernetes.

B

So with that anand I want to hand over to you. We've got seven minutes left to um one correct me. Tell me tell me how I'm wrong and to talk about the roadmap.

C

Sure I just want to complete your conversation on backup android, if you don't mind and then I'll wrap it up, please so on backup. I just want to you know, show a couple of you know ways you can backup the hcd cluster. So here is, you know some good documentation around it. This is for four seven, I'm going to paste this link here, but essentially, if you you know, do oc, let's go back here.

C

Oc get notes so now say oc.

C

Debug node: let's pick one of the masters. Let's pick the first master.

C

And then you would, you know, navigate to the host directory and then you can run this command right. This is user, local bin clusteredbackup.sh and then specify where you want the backup to go in this case home code assets, backup once you do that again, you know takes a few minutes. uh It backs up the hcd database and it also backs up all the hcd pod resources.

C

So if let's say now we go to this directory home core assets, backup, you will see two resources of at least two resources. These are two backups uc4, but the point is, you should see uh two resources. One is the backup of the database itself and then the backup of the static port resources, which also has the encryption key so make sure that if you're backing these two things up, you don't leave them at the same place. It's kind of like you know, locking your house and leaving the key. You know right outside the doorstep,.

A

Yeah just right there in the lock.

C

So yeah so make sure that uh you know there are two separate files. Are you know, stolen in separate places, but to restore it? You will need this file, which contains the static code resources because it contains encryption key.

C

So that's you know basic. You know, example of a backup. Let me see if I missed anything yeah. The other thing to note is we back up only the values and not the keys? So if you have resource types, namespaces and other object names, they will not be encrypted so make sure we don't put. You know confidential information and namespaces and you know other resource types, so that is around backing up uh hcd.

C

uh There is again a uh you know, procedure for restoring from the backup I'm not going to go through the sequence here, but I will paste the instructions and you can. I.

A

Just did it well, you did it too yeah.

C

You can easily, you know, restore from the backup and let's see what else yeah last, but not the least. You know encrypting uh hd right. As you know, you know a lot of the uh important you know straight information on the cluster stored in ncd and if my browser cooperates here, I'll show you how you can actually encrypt ncd.

C

Let me actually uh so one of the things you can do is you can say.

C

Exit out of my masternode and say oc edit api server.

A

You got a dollar sign beforehand: yeah, of course,.

C

Copy paste magic yeah, so if you see we'll see it at an api server, this contains config information of the api server, and you can see that the encryption has been turned to ae s cdc, which means you know all the http resources which includes openshift api server, the cube api server and the oauth api server.

C

All of them are encrypted if you want to turn this to off, you'll just change the type to identity, so you just change the encryption type to identity, and if you want to turn it back on you'll just specify the encryption type once you specify the encryption type you uh you know save this file out and you take a few minutes. You know for the api server to recycle and uh you know apply encryption now.

C

If you want to check if all the resources are encrypted, you know, starting with the openshift api server you'll punch in this command, you can see. Encryption is complete. All resources are encrypted, including the routes. Next, let's check the cube api server.

C

Again, you can see that secrets. Config maps are encrypted as a part of the cube api server and last, but not the least, the authentication operator or the oauth api server.

C

As you can again see, encryption is completed for all the access tokens for authorization tokens so pretty much all across the board. All you know. Resources have been encrypted and again, like I said, if you wanted to, um you know, go back and uh you know unencrypt xcd, you just go back and you know flip the encryption type back to identity and again you know, wait for a few minutes and you know the encryption should be turned off.

C

So again you know it's pretty straightforward. I'm gonna, you know place this link.

C

Let's see, did I miss anything else in encryption?

C

That's pretty much it so we're not. You know how to do a backup to do a restore to encrypt and the encrypt ncd and all the resources that are encrypted as a part of the process, and uh I want to wrap. I think we have two minutes. I want to wrap with the roadmap yeah, you know what's coming up next, so lcd has been bumped to 3. 4 9 is what the slide says, but I believe the latest update is 3.4.14.

C

I believe that's. What is the latest version.

C

We want to you know, improve disaster, recovery and backup right. So you saw me doing manual backups, you saw me doing manual restores, but where you want to get to in the future is to have a more automated way of doing backups and restores right. So what we want to do is, you know, provide a config file like a yaml file, where you can specify how often you want to take these backups, where you want to store these backups. uh You know how you want to name these backups.

C

You know, for instance, if you want to take periodic backups every 24 hours and then you know, uh send those backups to an s3 bucket in aws. You know we want to provide a config file where you can. You know, specify the automation around that right, so we're doing a lot of work around improving the backup and disaster recovery. You know scenario like taking snapshots destroying from snapshots, storing those snapshots specifying how recurring those snapshots should be backed up- and you know things like that.

C

So that's one of the major things we're working on the other thing. We're working on, like I said, is moving secondary disks for xcd as a part of day. One install, so you don't have to uh you, know worry about setting it up as a day to operation and with that you can get access to. You know faster disks as you're setting up the classroom.

C

The other key thing I want to point out is: we are looking at. You know, scaling up and scaling down the lcd operator. So, as you know, you know the default install of openshift, you know, comes with three masters, but let's say you want to scale down to a single node open shift. You know, maybe, for you know, code ready, containers or you know, maybe for edge deployments.

C

You know, for whatever reason you want a single node. uh You know deployment of openshift. You want to make sure that you know xcd supports. You, know those single load, you know deployments, so those are some of the things we're looking at. uh You know in the near uh future that we're you know actively. You know working on and also you know, always improving performance. Always trying to you know, gather more metrics from prometheus and uh uh you know always trying to make sure that the cluster is, you know reliable those.

C

You know those things will never go away.

B

All right well, thank you, anant, and I know we are right at time. I know we are.

A

B

Up against uh the openshift common session that is coming up, so thank you very much anand. We greatly appreciate you coming on and sharing with us today. uh Thank you very much to our audience. Greatly appreciate you sticking with us asking all of the phenomenal questions. Yeah. If you have additional questions, please don't hesitate to reach out to either chris or I uh so I am on twitter at practicalandrew, just like my username here and twitch uh or chris, is at chris shorts um on twitter.

B

So, and you can also reach out to me via email, andrew.sullivan, redhat.com, we're happy to answer any and all questions that you have, whether they're about etcd or not, don't don't hesitate to reach out and thank you very much.

A

Thank you see you in a few seconds.