Ceph Science Working Group, 24 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Science Working Group 2022-05-24

Description

Join us for Ceph Science Working Group meetings. We alternate the third and last Tuesday of each month at 14:00 UTC.: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Okay, well, anyway, should we, I think so johnny johnny.

A

I think had some questions.

B

Well, yeah: we we had some issues this monday or something, but actually I wrote that one of the things that we are currently having issues is the scrubbing in one of the biggest clusters we have.

B

So if you have any like ideas, what turnables we should be adjusting.

A

What what kind of hardware is it.

B

A

Apollos like are, they, are they 24 disk and is there? Is the block.db on the on spinning disks.

C

A

On flash on nvme.

B

A

And is this very small objects or large objects or.

B

Well, it doesn't have any objects yet, but we have like test data about 25 petabytes or something like from the bench tool.

A

Okay and how big are the files.

A

B

A

Just so, if it was the default rattles bench then yeah, that's four megabyte objects, probably yeah, so I I don't think so.

A

Yeah at least on our clusters. We don't change the default tunables for for scrubs and deep scrubs, but by default it's supposed to be it's supposed to be doing things in the background, however, if you have, I mean, if you have billions and billions of small objects, then maybe there are cases where the rock cb can become really slow reading through and iterating through the objects in the rocks db.

A

um That's why I was asking if it's, if that's related, but did you did you maybe write a bunch of objects, then delete a bunch of objects, then write more objects than delete like is there? Is there? Is it possible there's quite a lot of deleted objects in the in the I.

B

Don't think so we wrote it like without deleting those objects, so basically filling that cluster with test data for testing several scan areas? Okay, so it's not just because approximately like 78.

B

We started to have this issue that it's complaining that deep scrub and scrub not getting like done in time.

A

Oh, I see that's the issue ah yeah, okay! Actually, that's like a that's. Like a non-issue. I thought you were meaning. I thought that you were meaning that maybe the cluster is getting overloaded or something like that. um No, no! No! No! It's just that by default.

A

You'll get those warnings. If, if seth cannot scrub all of the objects every day and cannot deep scrub all of the objects every week but those those defaults, obviously they don't make sense for every like for a full disc. If it's like a 16, terabyte disc, it's simply impossible to deep scrub, all of the objects once a week. um So yeah you should you can just um there's a there's, a tunable to to have to to wait longer to complain to complain after a longer amount of time. Have you seen these these options.

B

Yeah we we had this uh email conversation with pieteri. I think pietre sent you.

A

Yeah yeah yeah yeah, okay, so I did already respond to that. Okay, yeah yeah yeah. It's it's completely safe! It's not honestly! That warning is.

A

Yeah, probably they should probably we should have some better default.

D

As districts go up, it makes sense to increase benefit.

A

Yeah, so what is the I'm just gonna check? What is the default? I think it's.

B

D

A

It's a week for the.

D

A

Like I've been seeing.

D

These often on my cluster, ever since I started putting in 16 terabyte disks.

B

D

Clear themselves, within a few days, but not even I can see those.

B

Yeah we have, I think, 14 terabyte disks, like three thousand of them, so yeah. It's way behind the schedule.

B

But if it's fine, if we just change that monitor like threshold for the complaint, yeah.

D

Okay, so thing for watch out for is this like you're using any raid controllers like they might be doing. I know dell calls it like patrol reads where it does: a consistency check on the whole disc, like that'll, slow down that whole disc and I o or scrubs, or anything for quite a while on large discs.

D

Maybe that be a contributing factor to why it's slowing down.

A

Yeah we'll have to check that out, but which warning are you getting? Is it um is it? Is it deep scrub or is it scrub? Do you remember? We have both currently.

B

Both okay, so after we generate the thresholds, we have like about 200 pgs, not deep scrub and scrub. In time after before that tuna balls, we had like half of that pg's like 6 000 of them.

E

Can there be a code logic implemented in place where, if the pgs are larger in size and based on the disk size, the defaults can be adjusted accordingly and increased.

A

Yeah, I'm thinking I'm thinking about creating a tracker for this, so that I mean, if we do the math, if we do the math for to deep scrub, every object in a disk in one week, what would be the megabytes per second that would be needed to do that. um I guess you have to also like multiply it by the replication count as well to count the amount of I don't know. I I think we should.

A

We can justify through, like the data rate, to see how how often it would take to deep scrub say a 20 terabyte disc, which is becoming the normal. Now I would say, 20 terabyte is at least this year or next year is what people will have mostly and then we can set the default accordingly. Maybe the default should be two weeks instead of one week or three weeks, but then yeah.

A

What grew up what you just suggested, maybe it can look at that it can be a fraction of a function of the size of the osd of the crush weight of the oc could also be useful.

E

Yes, so workload should definitely come into play here where at least from the code, it should be. I mean user shouldn't have to bother about these and from the cef code itself it could be taken care of if we implement the math and the logic and place and set itself, the thing is yeah, so we had.

A

Yeah, it's also workload related.

D

A

Because we had a long ago, we had a cluster that was six terabyte drives, but the workload was such that deep scrubs couldn't be guaranteed more often than once per two months, um but you would never have calculated that from the six terabyte drives, but because the drives were so busy with client, I o they could not. They didn't have iops for for scrubbing.

A

I don't know, maybe the approach is not maybe the whole approach of just alarming based on some function like this is not good. Maybe it should just be checking that it's not stalled for some reason that the scrubbing should be progressing, but it shouldn't be stalling.

A

A

We can open a ticket and see and see with the os with the routers team if they, if they, uh if they have a better, a better suggestion.

B

Yeah I tried to find some documentation about the.

C

B

Tunables for deep, scrubbing and scrubbing, but most of the tuna balls are like their osd, so you can't change them.

D

Yeah, it's not necessarily easy to solve once you get into the client io and taking everything into account.

E

Yeah, I think a lot of factors would come into play here.

D

Yeah just type io load everything.

A

I guess it's more about the the usage of the osd, not the size, yeah. Of course,.

A

Yeah, of course, because you want to scrub more often, while there's a few data and less often when.

E

The pages would also be increased in size, so if we could have more pgs.

E

I mean definitely pg account would play a role right, because if the data is more the I mean even even scrub, even time to scrub. A large pg would definitely take a lot of time.

A

Yeah absolutely.

D

A

But the number of objects scrubbed per second across the cluster should not be should not change as you change. If you change the number of pgs.

A

See what I mean, however, that now I'm remembering, if you have an erasure coded, if you have an east like wide ec pools, they lock many, they lock more oscs.

A

So if you have like, if each osd can only do one scrub at a time, then uh like three replica, oh, a three replica pool can be scrubbed twice as fast as a four plus two erasure coated pool, because the four plus two is involving it's locking, six osds and the replica three um three osds. So it's also the it's also the the pg size not uh count yeah, not num, not pgm.

A

Well, I don't know if I don't know, I can pass this up to to neha.

D

Yeah see what uh she thinks about it.

A

Okay, but I I mean yoni what I was. What I was saying is that in practice you just increase the ratios, that's what most people are doing now, I would say if you get the warning: just increase the ratio and increase that rate, the ratio for scrubbed and deep scrubbed. Until until you stop getting the warning, voila.

B

I have the project, we have increased, the leave scrub interval to two weeks, I think- or it should be like two week window, but it doesn't make much of the difference.

A

The interval we never change the interval, because what that does is it slows down? How often it starts the scrubs as well yeah, which doesn't really make sense. I think we want to keep scrubbing going like have a luck backlog of scrubbing, but you just want to warn later yeah. That's.

B

A

Accepting that ratio just just sets the tunes the warning to be to be to be postponed.

B

Yeah, that was what I was saying earlier, that when I was trying to find documentation about the different tunables, there is a lot of black magic that happens in the deep scrubbing.

B

There is no isn't a lot of options to change these, like it's grabbing, parallel deep scrubbing, where you have two numbers for changing: how much, how much one or what is doing job. But you can change the like amount of work that you could do parallel.

B

E

Maybe write a script to find out from pg dump which pieces are not scrubbed and uh we can get the timestamp right when were they last scrub from the pg dump and run a deep scrub manually.

E

A

Yeah, I have a script for that. There's a script in this in this in the cef scripts, thingies, there's a thing that orders the pg's by their last scrub time and forces a scrub on them, but I don't think it's needed that much. I don't know we don't use it anymore.

A

We just decreased that ratio and then we forgot about. We forgot about the whole problem. You can also set that you know you can also set the scrub intervals on a per-pool basis. But personally we haven't seen a use case for that.

A

Okay, all right.

D

um Other topics- that's just tossed in the pad, was like you know, there's some frequency hit. I don't know if anybody's even tried that one yet for releases. I want to talk about it, octopuses and then the life here.

A

uh Yeah there's a new octopus release that will be coming out. There's there'll be there'll, be one more octopus. Release is.

D

C

A

A

It's rel so we're still running octopus, we're starting to upgrade to specific our clusters. But octopus is the last ceph release that will support centos 7 and that so that isn't a problem for our ceph servers. But we have openstack hypervisors that are running centos 7..

A

So we won't be able to upgrade like lib rbd beyond octopus until they until they have until they can operate those the os of our. I don't know: 10 000, hypervisors or whatever.

D

That's a better work, yeah.

B

We are still running nautilus, the last version of nautilus in most of our production.

A

Yeah? Why is that? Why why? Why aren't you upgrading to octopus or pacific.

B

Well, the most biggest reason, I think, is that we don't have time to do it as we are bringing bigger and bigger clusters production after after the peak cluster is it's coming in pacific. I think we will run in pacific. We haven't tested quincy yet, but after that is in production, I hope that we can upgrade from nautilus to pacific. I think we would be skipping octopus.

A

But this this test, cluster, that you mentioned before with the I can remember, 20 petabytes of test data- is that pacific, or is that also nautilus? It's.

B

Specific okay, yeah, okay, we are basically doing pre-production testing. Okay with it.

D

I know for me: I have some os updates to do before I go beyond octopus. These days.

D

Hopefully, in the next before the end of the year, my goal, at least for the getting out pacific.

B

Yeah, which os did you prefer after santos, went to the stream version.

D

I think I'm gonna start switching the rocky linux for stuff, I'm doing the centos stream eight right now, but I think I'm gonna slowly go to rocky.

D

Gotta find the time to set up those kickstarts and everything.

D

The one thing I saw on quincy that was very sad for me and probably about the only person without file stories being deprecated.

D

Person that effects on the planet probably.

A

Yeah not just not just faster but also level db,.

D

A

If you have old clusters, um even the mon db won't load. I guess.

D

Yes saw that as well, I thought didn't they do like a background upgrade on that to the rocks to be a couple versions ago, where they mostly forced everybody to do that.

A

Yeah conversion memory is vague. I know that we changed to rocksdb, but I thought that we did it by. I thought that we did it by deleting the deleting the mondb directory and redeploying them on, like resyncing them on yeah, but maybe if people had ceph deploy or stuff adm, it happened automatically.

A

D

I thought I thought somewhere that was done automatically or maybe I'm remembering wrong. I redeployed my mods at some point too or something so I know those are fine.

D

B

D

D

Finally, gonna have to kick my team in a year and get us on blue star one. These days, yeah, probably.

A

It's really better. It's like much better. You should yeah. I know highly.

B

D

A lot of work we've got to redo a few things. It's really not too complicated on our custom software side, but it's moving a lot of data and then rebuilding the entire 1200 osd cluster.

A

To blue stores, but you could you should tie it, I mean you should I I imagine you're replacing the hardware sometime soon. No.

D

No, not really we just kind of our funding works where we just kind of add a few servers and remove a few servers at a time. So we never do any really big, like oh we're, gonna, add five petabytes at a time and remove and own five petabytes. It's very.

A

D

Right half a petabyte at a time gets replaced. I see and I can't run the mixed unless they start splitting the pools and saying you can't run on the new hardware. I can't run mixed booster and file store because it's just one massive pool generally uh it'll hate some of our objects.

D

I think what I'll end up doing is either put the those files that are too big um we're using the liberators, if you don't recall and input those on s, the gateway or staff ffs, and then all the other data sets always on uh continuing libraries.

D

Probably the plan.

A

A

Cephalocon stuff.

D

Yeah um I'll I'll, be there, you think you're making.

A

I'll, be there cool I'm just reading about. I just noticed the kuba kub.

A

D

Haven't even seen that one.

D

Yeah well, I know we got a there's, a birds of a feather session for our science group yep. I think on the evening of the first day or something so.

D

Hopefully we get a decent turnout there and see what happens.

B

Sadly, I I can't join to say polycon this time, maybe next time.

A

Yeah yeah, hopefully you you can also register virtually. I think something like one third of the people. Oh I I don't know the latest numbers but there's a significant virtual attendance. Okay.

B

I have to check that.

E

I'm planning to present virtually as well couldn't get out. Okay with the problem yeah, but anyway, just had a quick question. How many of you in the users group are using centralized logging of some sort of away rsf.

B

Have plans for it.

A

We use elasticsearch for the for s3 logs, but we don't use, uh but not for sep itself. Right now,.

A

Although we have no plan.

E

A

Have one student we have one student, that's testing loki to to push all the logs centrally.

E

Which was merged in the master recently we, I was researching with one of the dashboard engineers, so centralized logging with loki and prom tail is merged in the master.

A

Oh, it is okay, yeah.

E

We were researching between uh low key from tail in elasticsearch and even greylock, because greylock insight of greylog is implemented natively by default. But graylog has some issues like you have to statically, configure everything so, but out of the whole decision, loki and prom tail were chosen since grafana is used already.

A

Yeah yeah, that's cool.

D

Not using loki with staff and use it in my kubernetes clusters and I'm really liking it a bit more than the elastic searching havana class.

D

I've been thinking about uh streaming uh like seth, logs and everything else in the loki two. The nice thing is, you can just uh point it at the radio's gateway, s3 back end for its uh storage benefits for all the logs.

A

So this um this lucky, I'm looking at that loki pr was that did that make it into to quincy.

E

I'm not sure if that made it into quincy or not need to check with the probably mike.

A

Oh, they have actually tagged it for back port to pacific even to back for back port to yeah. It's all to do still. They haven't nothing's been merged, yet okay, cool. Now, that might be useful.

E

Yeah from a user standpoint, I thought um when the clusters can be as huge as possible right, we could have logs in one place. We can do a pattern based searching for it would be a one-stop search right for the logs and troubleshooting could be much more easier from a user standpoint.

A

Absolutely yeah.

E

Yep, looking forward to present about that and stuff look on.

D

Great yeah sounds great.

D

It seems like uh that's a lot of the way this features and stuff have been going. Lately is more easy usability of everything, the dashboard, if you're going to do low-key and logs as well.

D

These are clustered management, big focus last year or two.

D

uh Adam in the chat and change your mind to say something about us, ffs issue so feel free to I'm here to speak up.

C

Hello yeah, it's quite a weird one: we have a power 9 node running some flavor of redhats on.

A

C

Which was perfectly fine with the kernel client, uh the cfs until recently, when the performance has really turned abysmal and I'm a bit perplexed, because other clients are fine. So I think the service itself is basically okay and I see lots of messages about sockets put one in the chat.

C

Lots of messages like that.

C

Sort of every few seconds, when you're trying to do something with the mount.

C

So I was running an fio test earlier today and.

C

In comparison to another node, where it took the expected five minutes, it was predicted to take like two and a half hours.

C

They're really really really bad.

C

And I don't really know where to start, because all the say all the nodes are connecting perfectly.

A

Is but but that's your only power, nine node or the other power nine nodes are fine and that one's not this is the only one.

C

You could say, don't use.

A

C

A

C

That there were no.

A

But I've seen in the I've seen in the in github various fixes related to power because it's like, I don't remember, which one it's like it's the other endian from x86, so yeah.

E

A

Of the some of the buffer, encodings were like not were not done in a endian agnostic way yeah. So there.

D

A

Fixes, like that, um could that be causing this.

C

But I don't I mean it: doesn't I've not upgraded the kernel, there hasn't been a kernel upgrade for quite a while.

C

um I don't think there's been any theft related upgrade on the repos that are available for it in quite a while. So it's why it's suddenly changed it's perplexing. Oh.

D

Yeah has anything changed in the cluster like the overall size or stuff like that.

B

B

You happen to have firewall on the client.

B

Because that's hap has been like in the past. Many of our issues have been like firewall related on the client side.

C

I'll check that, but it I don't think I would have got it working at all. Maybe it's been working for 18 months. Quite happily, okay.

C

I guess my thing is.

A

It stays out after rebooting it it stays. I mean that I'm also thinking.

C

A

And didn't make any difference check. Maybe the uh can you check for like tcp errors, tcp retransmits, like maybe there's something wrong on the network interface? Maybe it's not really to power at all. Maybe it's I've seen this in the past.

A

um Yeah. Basically broken network interfaces that sort of work but slow everything slows down.

D

Yeah, it's flaky yeah, flaky sfp, sometimes.

C

That's a good point: yeah it's using a melon card, but so maybe one of the one of those extremely heavy cables has been slightly nudged.

C

Can't see loads of errors or anything.

D

A

Suddenness and.

D

Lack of changes makes this.

C

Yeah, okay, I was hoping some would say, you've missed this really obvious thing.

B

It's most probably what would be some like very simple that could be.

A

But so just just yeah just to be clear, like those kind of like socket closed messages from osgs those. I I only see those like when we when we upgrade sfs and we restart osds, then the kernel clients show that message.

A

So it's like we are restarting when you restart an osd it'll show up in the kernel client with a message like that that the clock that the socket is closed but in your case you're not restarting the ost, so it it means that some yeah, some some somehow the connection's getting interrupted and like maybe the maybe the osd side, maybe try to get logs on the osc side.

A

Although you probably have hundreds of osds look at logs on the osu side, see, it might say there might be if there's crc errors or something weird like that. Maybe you'll find something on the osc side.

C

Yeah, it must be networking somehow just yeah.

D

I've seen something in one of our custom apps, where we were hitting like uh view limits for open files sockets, um but you would probably see a error more regarding that. That was a case of this.

D

The kernel would definitely complain about too many open files for your users.

C

Yeah, it's not seen anything like that.

D

Yeah, probably not probably not done.

C

That's anyway, yeah regarding cephalocon, I was very keen to go, but the flight was going to be twelve hundred dollars. I thought I can't really justify that.

D

Hopefully you can join virtual and.

C

D

Dan, do you know if they're doing like virtual heart for like the birds of a feather session for like ours, or not sure how that would work?

D

You have to you're supposed to pre-record your stuff if you're presenting do it virtually and.

A

Yeah, I can ask I'm not sure how that's gonna work.

D

All right um anything else, we want to talk about or.

D

Cover it for today.

D

Maybe I'll join at 9am tomorrow to see if people are confused and still joining tomorrow,.

D

A

Yeah, I I won't be able to.

D

Yeah I had all people said too: they would work for them, but uh could just be bad luck this month with people that not too many.

B

Joining yeah, I actually got this invite from ps3, so I seem to be missing from the mailing list.

D

Oh sure, if you want, but I sent it to the stuff users and then I also send it to like a private mail list of people who joined um so as long as your email is in the sign in section of the uh that pad I'll. Add you to the private list so that you'll get a separate email. That's just a group of us who've been to these before and usually.

B

I will add my email cool. Thank you.

D

All right! Well, if that's that, thanks for joining and maybe we'll see some tops when we all at cephalicon or join virtually or be there. If you can.

A

Yep have a nice day get you around thanks, bye,.