Ceph Ceph Month 2021, 14 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: 5 more ways to break your ceph cluster

Description

Presented by: Wout van Heeswijk
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

So, uh a couple of years ago, uh vito uh our the founder of 421. He uh did a interesting talk, called 10 ways to break your cef cluster and we thought it would be kind of fun kind of interesting to uh to find to share our findings and share our five more ways to break yourself.

A

Cluster here we have the overview of the five. We also included the bonus one.

A

One of the ways to break your shift cluster is by over or underestimating your automation tool. Whatever you are using, it doesn't really matter.

A

We had two cases where something like this happened.

A

One case we had a script that was run in front of the automation, so automating the automation tool and through all of that, some checks were missing and it didn't check for the amount of monitors that was set and it just didn't fill out that variable overruled the existing configuration applied. It said everything is okay, yes to all and started removing the monitors, and this tool is also pretty thorough, so it also removed all the monitor databases and everything there was no trace of them anywhere.

A

Fortunately, we could recreate them by scraping the osds and the customer didn't lose any data, but there was severe downtime. Obviously, in a similar case, it was quite interesting in the early releases of cfmen we had somebody who also lost their monitors.

A

This was a case of well a monitor, just monitors the system, so I can just remove it and recreate them, not understanding the exact importance of the monitors we found out that somebody was using docker commands in the back end and we was removing those or did remove monitors when he saw some errors and tried to recreate them.

A

Fortunately, we found the original monder directories still hiding in in the docker directory. We could recreate them.

A

Also, not no data lost, but availability was not great.

A

Another one is uh the size is two was the previous one, but we kind of revised that to the min size is one uh we, we still recommend that any. In any case, you always run with sizes. Three, if you value your data or do something with the ratio coding, but we'd rather not see sizes two, and if we see size is two, then please, please don't go with min size is one.

A

If you have a good case, then you will only have recovery and file which is bad enough. In a bad case, you will just lose placement groups.

A

We've had a couple of cases like this in the last two years that that are just bad in every way.

A

We also found an interesting case, a couple of interesting cases where um people didn't complete the update in the chef documentation, there's an update guide, but it's it's not specific to a release. It's a generic update guide, especially with the upgrades to nautilus and the messenger version version 2.

A

Yes, we saw that that there was some discrepancies and some parts of the upgrade were never executed and miraculously the cluster will survive for a some time, um but then it will break and it will break badly for people and it's very hard to troubleshoot. If you don't know what you're looking for. So, what you're looking for is something that looks like this um continuously for a long time so and the solution is to finish the update finish, all the steps for updating to nautilus or octopus. I've seen a cluster that survived into octopus.

A

A

Then you have the completing an update too soon, with the enhanced security for 14.2.20. We had a customer that already set it to false, but didn't upgrade all the demons, not really big impact, just very annoying that his rgw's weren't getting connected.

A

This one was also very interesting.

A

Was we took a while to figure this one out, so um we had a customer that was having problems with um with their rgw environment um and after a while, we we saw that the ip addresses in the service map for the rtw service only it showed three rgws and the customer said. I have nine rgws and we saw the ip addresses in the map updating all the time now.

A

What we then saw is that there were nine rgw daemons running on the network and they were all claiming to be only three of them, so they were all skipping and updating the ip address and the load balancer in front of them was trying to restart the uploads, and it was a bit of a mess. They.

B

A

Of filled up multi-part uploads and was not performing well, and they were actually already investing in into new hardware. We fixed it by just having nine names that fixed the problem, and we we scraped cluster for orphaned objects,.

A

We have one bonus, one which is blindly trusting the pg auto scaler. So I think a lot of this has to do with chef becoming much easier to use. So we see much more installs and you don't have to understand or know as much as you used to, but we've seen that that clusters are installed, we're all with all the defaults but they're reasonably large, and then they start ingesting data.

A

And that's when the pg autoscaler comes in and says ah right now. We need some more placement groups here and then it starts splitting placement groups and then you start ingesting more data and then it says, oh then, you need some more placement groups so, and this was a cluster that was in the uh almost a petabyte.

A

So it was having misplaced data for a long long long time, especially because they made the mistake of not having any ssds in the cluster at all.

A

So we also looked at the original ten. That widow did. I, I urge you all if you haven't seen that one also look at the original 10 ways to break yourself cluster, but from the original 10 there's number six that we don't no longer consider a way to break yourself cluster because it's been superseded by blue store exit. We don't use xfs that much anymore.

A

There are also talks of removing file store entirely from quincy. I believe so. um This is the one that is fixed by bluestar. uh I think there are work on on the way to also improve the autoscaler. That's also great, but most of them are still ways to break yourself. Cluster.

A

That's it. Thank you very much for the time.

A

And if there are any questions, I please do so.

B

We have a very quiet audience today. Either people are having issues uh seeing the slides.

C

I have a question: is dan here, hey dan yeah hi, um so we've luckily never had to restore mons from the osds like we've never lost all our mons. I'm just curious. If you I know there is a documented procedure somewhere on the seth docks yeah like this. Is that what you followed is it? Does that work still work yeah that works okay,.

A

It takes a little, it takes a little time, but it works, but it it copies. The data of the database over to from the osd's adds it to the database, then um so this the lucky we were very lucky. This was a very young cluster, so there was not too much data to scrape, but I think if your cluster grows and your your usage, you use it much longer uh much more changes, then that process will also take uh much longer.

A

We were fortunate enough that it was enough in production that you don't want to lose it, but but not it was not an old cluster.

C

So, with your with your supported deployments, do you have any backup strategy for the mons, or do you sort of rely on them not getting destroyed like that and like? I guess, because it's not really possible to back up the mons. Is it not really.

A

No, no yeah, you can back them up, but the data is is always you're, always behind enough that it doesn't make sense. We've also seen I could have included that one too we've we've seen that somebody had a virtual install and they had some interesting data to them on a virtual chef install and somebody was updating and then the updates didn't go very well and he just rolled back the virtual machine, all different states.

A

B

Are the things that you.

A

Can also encounter yeah. That was the sound I made as well. Oh so yeah there's no real backup strategy there. What we, what we try to do is just monitor your mons very well, make sure that you have enough of them better on maybe two different hardware fenders, so that you are unlikely to encounter the same problem on on all of the monitors um but yeah.

A

That's that it's very hard to to back them up. There's! No, you! You should stop your entire cluster and not have anything move at all, and then you can there's no way to restore. Thank you right. Thanks for the question.