Ceph Science Working Group, 23 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Science Working Group 2020-09-23

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh Welcome to our little september, get together here.

A

Oh, the pad feel free to sign in to add topics to the pad that was in the email anything you want to talk about, or just say it out loud at any point as well. I suppose.

A

um So my quick little 30 second spiel. What this is is we're just a group of research, sysadmins or big clusters assist admins, who use saf getting together to talk anything about it pretty much every once in a while um feel free to contribute to the conversation, I'm kind of just I'm not presenting just trying to keep things moving along. You know, um so I guess I started out. We kind of just had a few general topics of uh anybody have recent outages. They want to own up to their faults, something else's fault uh since.

B

A

A

I've got one, we had a total power outage for four hours and our ups is only lasts about 20 minutes. um We don't have generator backups, so the entire data center went down um overall. Two out of my three clusters came up without problems. uh The third one, which is the bigger, almost 10 petabyte one had uh issues uh just regarding two holes didn't come up right away, um need a little kick once I got access to it all again, just during the boot process.

A

um The other stuff was just corrupt uh journals uh here and there. So I was rebuilding, probably oh 10 journals or something around. They seem to be located mostly on just one or two hosts um overall um big outage, but first off not my fault, which is always nice and uh no lost data. It took a little while to figure everything out after everything got up and do a little rebalancing, but it went pretty well overall if nobody else has any outages or anything um kind of uh bugs and stuff people fit.

A

Since the last time we've talked, I didn't see too many critical things over the list over the past uh two months.

C

At least bug wise, I added I added something to the the list there in the in the pad um if people are running 14 to 11, actually, if they're writing 14 to 11 or any think recent octopus as well.

C

Yes, some of the osd map trimming logic is inverted, so it trims to it. It trims too, sir, too early. But I don't know how critical this is.

C

I don't know if it there's there's basically two there it when it's trimming osc maps, it looks at two different numbers to create a lower bound like for which value to trim to, and it looks at the osds, and it also looks at the pool clean epochs, and I think that, because it looks at the pool, clean epochs- and I think that that makes it fine that there's something broken in which osd's it's looking at.

C

Anyway, we patched our clusters and, but I haven't heard of any any outages or anything caused by this.

A

Yeah, I didn't see any outages either. I saw that that, in the release notes on 14 to 11.

A

um One thing I notice: I'm I haven't reported it and I don't really know if it's a bug or just something in my environment, yet um my file store osds um seem to not be cleaning themselves up after pg's migrate around. So I get artificially high disk usage on some um kind of looked into it a bit.

A

I haven't found like anything in the tracker that seems relevant to it yet so I might end up having to get some good debug logs and post something about this one. um In the end I tend to just give up on because most of these are all like old osds uh that were done with seth disk. So I end up just rebuilding with seth volume and go on with my life. It's you know one or two here or there that I happen to do it's just kind of a bit of an annoyance.

A

um But uh I gotta upgrade coming up to 14 to 11 soon and uh see if it still happens on that and if it does, I was gonna report. It.

A

But I'm also one of the few people still using file store yeah I'll get away from it.

A

A

Anybody else gotta hit any bugs interesting ones over the past two months or.

A

Seen something.

A

All right uh next stuff on the list was just in general how's life with octopus. um Anybody actually move large clusters, positive negative experience, doing.

A

So I haven't done anything with octopus yet so I can't speak to this one.

B

Well, we have updated our cluster to octopus a long time ago, so I reported already the last meeting and we didn't have any issues with it so just to work, fine, so yeah, so we moved two more classes to that. They also don't have any issues, though we have some bugs actually some problems.

B

That say not really back so how to describe with kernel, client and capabilities when you change the when you're adding new pools to the file system, for example, the pro the changes are not propagated to the current client, but probably try to investigate this offline. So it's on some clients. It's migrated properly on you know some others. You have to remind a file system which is kind of uh uh not very good. Let's say.

C

Does that mean that it's caching, the the pool layout or something like that for the directory and not refreshing.

B

Not the layout, if you, if you add new data posts, though the permissions for those new data pools, are not propagated.

B

So I was just debugging this in the morning and it didn't come to a sensible conclusions, for example on our large cluster. One third of the notes only see the propagation, the other two thirds didn't.

B

Which which, which kernel, is that uh it's five seven, those reasons.

C

Isn't that isn't the name of the pool? Isn't that part of the the directory layout.

B

Yes as well, but the layout doesn't have problems so you can create the directory. So mds works fine.

C

But then, on your on your stale client, if you get, if you use f, get.

B

It's not it's, not. The client doesn't come stay, but if you want to write the new data pool, then you have a read or write error ah so os. These permissions are not seen in the.

B

Very strange, but anyway I probably reported back later on.

A

A

A

uh Dan, it looks like you were wondering about.

C

Nautilus and the gateway- oh yeah, I mean well just because tomorrow, we're gonna upgrade s3.stern.ch to to from luminous to nautilus and we've tested everything, and it's would be okay. We already run our second s3 cluster on nautilus, so I know that it's fine, but I'm always worried about the the way that the rattles gateway, like all of the region, all the region settings. I'm always worried that, like the region, settings from one version don't work to the next version, but it's hard to test that.

C

But anyway did anybody have a luminesce to nautilus, s3, bad experience or good experience.

D

We should count it as a good experience. I mean we don't have massively heavy usage of rs3 endpoint and I'm not aware of anybody making use of the uh region features or doing anything complicated with it, and but I mean we do. You know, there's a there's, certainly a lot of internal use cases where people are dumping.

D

You know log files and things, and then we have a growing number of um well genuine users that are trying to think um data across uh from their on normally on campus experiments uh that uh with various random things that they download off the internet because it works with amazon and they you know yeah exactly you know we ask them what they're doing and then oh, I saw this and I downloaded it and it works fine you're like okay. Well, there's there's several of them. None of them had problems when we did it.

D

So it's not exhaustive, but uh okay, thanks for being interrupted.

C

By my three-year-old here I'm sorry no worries she agrees. She agrees, I suppose, no problems from the upgrade he or she. Okay, thanks for.

C

A

When you say you're upgrading to the gateway tomorrow, are you doing just the gateways? Are you doing all the osds and like the entire cluster in one go yeah the whole cluster in one go gotcha.

C

Oh good luck with that sounds like you should be fine. I think you'll see on the mailing list. If it doesn't go, fine yeah.

A

I'll make sure to load that up tomorrow morning and see what's on.

A

D

I mean, while we're on the topic of of well, I guess yeah s3 upgrades. I mean how well how well integrated. Is it with um your openstack cluster? So if you've got users on openstack, are there um various things that they're using to because yeah to basically uh automatically access um s3 will swift?

D

I guess the integration with swift is better because that that's certainly an absolute topic at rao at the moment and and people are, you know, adding the features to the cloud actively because there's the demand from the users that want to be able to you know: they've got projects with an open stack and they wanted to be able to create containers and uh automatically. That would be the in the s3 cluster. I don't know whether that is a have. I jumped ahead of the schedule with that particular question.

D

Sorry was that I you know is the agenda. I don't know how how structured the agenda is. Can I ask a question like that, or should I wait for later in the meeting?

D

That's fine, I dan, do you do.

C

So the problem in the past was that if you use okay so yeah, we we want the same thing. We have openstack. We want users to be able to click on the object, store button in openstack and manage their ect credentials uh also to use be able to use swift, although they don't really really, they don't really want to use swift. They want to use s3, so they want to be able to create ec2 credentials and delete them and manage their quotas, and things like that.

C

So there's that there's an option in router's gateway to just do the s3 authentication to buy to let to let keystone do the s3 authentication, but the problem is that uh this adds an authentication hop for every single io, the router gateway. So it slows things down a lot and it hammers your keystone.

C

So for swift for the swift api there's a way that it caches the the router's gateways can cache. Like a token or something like that, I don't really I'm not an expert on how this all works exactly, but it's possible to cache it with s3 credentials. It's not possible to cache it.

C

So what we did was jose wrote some machinery uh that basically synchronizes all of the ec2 credentials from keystone into our rados gateway it. It writes them as local keys, local users with local ect with local s3 keys.

C

um So then we so then we do the authentication we use the local plug-in for authentication in router's gateway, and then that makes it fast. But my understanding is that there's a proper way to do this now, but I don't know how that works. I think there's a there's a better way to do that kind of synchronization.

C

C

That was that what you wanted to hear.

D

Yes, sorry is that an octopus feature I I couldn't find my unmute button or when you said, there's a proper way of doing it now or was that. Is that a new feature in the open.

C

Okay, fine, I don't know I can try to. I can try to dig it up and share it later or maybe someone else knows.

D

Yeah I was just kind of looking through the various clusters that people uh I think. Well, I guess most people yeah people are using rbd for openstack, but uh I thought I saw somebody else has an s3 um buster as well for that um oh yeah, s3 and rbd storage. That was um matthew vernon. I don't know if he's if he's on, if they've uh I don't know, he's not he's not attached to this meeting, um but he did put his his information in the in the uh seth pad or sorry. The pad.ceph.com thing.

D

Okay, well, we'll keep looking at that. Thank you, though dan. That is useful because um yeah, we would have probably run into that problem, and so we now at least, can ask you what you've done or for the details. So we can stop running into the s3, because we're the same that everybody wants to use s3 because they've just heard of that, and even when you kind of explain that there's better integration with swift and it's approximately the same in terms of what's logically the same they they're like.

D

I want to use s3 with this thing that I've seen on the web. I don't know if that's a common user experience. People seem to have a common experience as a site. You have with the users just wanting to basically do everything they have that they can do with with um just with amazon. Oh yeah.

C

Right yeah so much, we have a lot of s3 users like that. I mean so like our get. Our git lab has uh like 70 or 80 terabytes on our s3 cluster into two buckets, like 50 million objects,.

C

Actually, that brings up an issue that I'll put to the bugs to be aware.

D

D

And, and just with with the large amounts of files in in buckets, because we've had that a couple of times that you know bucket has somebody's done something silly and they've written, you know millions and millions of files and when they get really big, they seem to be really difficult to manage. You know we often them.

D

Unfortunately, most of the time when that happens, it's somebody's made a mistake and you can delete you know you can often just delete the bucket, but I I kind of worry that there might be a case where somebody, you know, there's a lot of useful information in there. Then somebody also writes something stupid. The bucket seems to start performing um erratically, and then it takes a real effort to actually clean up.

D

All the you know, the you know, because you're having to list you know, do lists and things to find the objects that shouldn't be in there and to delete them. I don't know if anybody's had had similar kind of experiences with that, or is that just uh I mean? Is that possibly something to do with the the design of the cluster? If we need to put the shards on on faster discs and things needed to be more performant,.

D

Because I knew you mentioned, I mean if you've got 50 terabytes split between two buckets: that's and it's git labs, I'm assuming they're, mostly small files of the kilobyte range that are going in there.

C

Yeah, so we haven't noticed any like operational problems, except so one month ago.

C

We so we have automatic resharding disabled, and one month ago we decided to resharp the the get lab registry bucket that had 32 million objects in it, and the users use rclone synchronize that bucket onto another three um just as a backup, and they found that after we restarted from 32 to 512 index shards, the their backup process doesn't complete anymore.

C

They have a one one hour time limit and now, like the they, they benchmarked the the list operations and the lists are like ten times slower after we restarted, so that I I was reading through the nautilus history, and I found that someone just implemented some kind of optimization for for, like highly sharded bucket indexes like that, because the problem is that when you have so many shards when you, when you do a list operation, uh the rados gateway asks all of the shards for their first 1000 entries and because because the list operation is paginated, 1 000 entries a time.

C

But to get the first 1000 entries sorted alphabetically. It has to ask all of the shards for their first 1000..

C

So if you have hundreds of shards, you end up with hundreds of thousands of entries to sift through and then sort, but then in in nautilus, there's some kind of optimization for this, but it doesn't send so much traffic around the network and I haven't tested yet. But I'm hoping that that's better is it so? Is that also automatic or is there a flag that we need to step to.

D

Automatically, okay, yep.

D

I'm hoping, but we can have a look because I mean what you described is exactly we've got people are sinking large amounts of stuff over the place, and then we get we see, you know up, you know yeah hundreds of hundreds of shards and then they start saying it's slow because it's it or they start complaining that listing files is like which version are you running so well.

C

Yeah, because this landed some in one of the recent point releases, I'm not sure which one I.

D

Have to check with with tom um it's interesting that yeah we're not the only people to experience that it's not just related to the hardware or whatever, and- and there may well even be a fix for it or an improvement at.

D

A

So I was wondering if um how well the stuff balancer is working for people and up map mode.

A

So far in my experiences, it does good at first, but then everything kind of drifts away and it doesn't do much after I add more osds into the cluster or you know I tweak the settings a little bit and then it remapped some stuff, but in the end everything kind of does end up getting a little.

A

Out of balance again- and it just sits there saying: there's no optimizations.

A

Just wondering if in general, for you guys is it working well or is it just something I'm doing wrong with it that it's uh not doing a good job.

B

Well for us, so our cluster has a pretty high load all the time, so we have disabled it because we noticed that it runs too often and also for the balancing we went quite high. Let's say, are quite low with the let's say: standard deviation with the requirement.

B

So in this way it stays typically well balanced, pretty long time so, several weeks, unless some disc goes out and so on, so then it's throwing the.

A

Are you just doing the occasional manual run when needed with it.

E

B

Goes with the new this new feature with the resizing of the placement groups, so it was also too aggressive. So it's nice if you have a small pulse so that you don't allocate too many placement group at the ones, but when you have large pools, it's typically better. If you do it manually and disable the.

B

C

I just put yeah, I think this. The key setting is up max up map max deviation because the default is plus or minus five. You can still have p. You can have osc that are 10 10 pgs away from each other, and then the balancer says it's optimized right. That's the default! Setting so check that you have that one!

C

That's uh look back and see what she can do self config dump to see what you have.

A

Yeah, I think, like yeah, I can't remember if that's the one I was playing with or or not like, I saw I think that might be the one where I saw people going sub 1, like you know, 0.5 or something on on the lists or whatever to get. One is.

B

A

One is the maybe maybe I messed with something else. Then.

A

uh I'll give that a shot buster after the.

A

A

So one other thing I had a question about: was anybody using ceph csi, um not necessarily in relation to you, know big clusters, but in general, with you know, just kubernetes and integrating stuff with it?

A

Has anybody actually used that one yet because I'm still using like the old rbd provisioner from a couple years ago, which works great but definitely seems like the csi method, is the way to go and I'm going to be real rebuilding my kubernetes cluster real soon.

C

Yeah you will use that with rbd or with ffs ibds. Actually, probably both we use it, we use it. I mean I don't really know anything about it, because that's a different.

A

C

They use they, they use it, and the only thing is that is that it was like tricky like we had to tell them to to set different options like fuse options for the stuff clients for ffs yeah. It was like trick. It was tricky for them to set special options like fuse big, writes and like the auto, the auto reconnection settings that we like to customize.

C

So I I think that they added some stuff to sub csi to make that all like configurable.

C

A

The one but otherwise.

C

It's it's fine! You just have to make sure that you can pass the settings that you want.

C

And I don't think it worked with kernel last I checked, but maybe they fixed that I think it was only fuse mount or was it because they both.

A

I don't know interesting. I know the in general kubernetes should work with the kernel, one, that's how I'm doing it now, just with a standard storage class that gets mounted around or whatever yeah.

A

Maybe I have something interesting to say in two months about that one. I got the hardware here, I'm just waiting on some networking to get the new cluster up and.

A

A

That kind of concludes the topics list. If anybody wants to add anything else to it, quick or go free form, if you've got anything, you want to say now's the.

A

A

All right, um if there's nothing else, uh thanks for joining a little smaller group today, but I saw- and I remember got a few emails about people being out makes sense at the end of the summer might as well take advantage of it.

A

um Otherwise, the next one should be two months.

A

A

25Th, probably yeah.

A

And I'll send out the usual emails. I don't know a week ahead of time.

A

uh We'll talk to you all then.

A

Thanks kevin yeah, thanks for joining.

B

Hey bye, okay,.

B