Ceph Ceph Month 2021, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Ceph in Research & Scientific Computing BoF

Description

Led by: Kevin Hrpcek
Ceph Month 2021 schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

All right: well, I guess it's um time for the birds of feather for stuff and research and scientific computing.

A

um I'm kevin. We kind of just do this uh every other month, a group of us and uh that's on the the stuff community calendar. If anybody ever wants to join in on that or you can just contact me and I'll, add you to my email list as well.

A

uh Otherwise, there's a pad in the chat. If you want to add topics, uh anything um it's a birds of a feather, so I have no presentation. There's no set the topics, just ideas, it's what we want to talk about with stuff and scientific computing.

A

So, if anybody's interested in sharing, if they have interesting, use cases, they're doing uh with staff or fun experiments, uh they've done uh to you know, push it to its limits and make it work for what they're doing and be the love to hear about it.

A

And dan, unless I just taught typing the very long uh stuff of.

B

Well, I mean normally we do a bit of a round table, don't we yeah, I could say, what's happening, I could briefly say what's happening at cern at the moment I mean yes.

A

Okay, well, are we going to rbd.

B

Ones I won't say because next week uh our new fellow artillery that's on the line, will give a presentation on some rbd mirroring.

B

S3 wise well enrico's on the line. I don't know if he wants to say something about s3. Maybe we already just covered that again. Fs wise we've been doing some tests. uh We want to we're adding a new cfs region and we want to until now, we've never had snapshots. We've never used snapshot in any scale, and we want to for this next cluster to enable snapshots from the beginning.

B

So we've been kind of stress testing it to see what the limitations are, and we learned a couple of things that maybe people already know. Maybe this is obvious, but we didn't know. um So I guess what we did is we just untarred linux, a bunch of times, hundreds and hundreds of times and then we're taking snapshots and we're deleting things, and we realize that deleted files go into the stray directory, just like just like deleted hard links.

B

Go into stray so there's a limit in nautilus and octopus of something like 1 million um 1 million files and then, after after you delete the after you've after you've like taken snapshots of one million files and then deleted them from the head. You now can no longer uh remove files which are in snapshots so, but good news is that in pacific this is fixed because the three directories can be fragmented.

B

So I think we've come to the conclusion that for this big cluster, we're going to need to upgrade to pacific in order to in order to uh to go production with that, and then the other thing that we realized is that um so we started deleting all of these millions and millions of snapshots or hundreds of snapshots, let's say to then trim the snapshots and we didn't see any progress for hours in the snap trim. Pgs um cluster was stable, but there was no progress.

B

So then we realized we looked into how it's implemented and in fact you only delete like 15 files per second from each osd with the default configuration. So this is actually really slow. So if so, if we have a lot of churn, the pgs will be in snap trim. Like all the time. I guess um we were able to change the configuration and managed to trim our. I think we had something like 50 million files that needed to be trimmed and we trimmed them within half of a half a day.

B

So I think with tuned configuration, it's fine, but the defaults are really really slow, and I know this is related to the recent, like blue store story with pg removal.

B

So I don't know, I know in some tickets, like igor, had written that there was maybe an opportunity or plans to rewrite the snap trimming, but I think that's all still in theory, but otherwise this whole thing was stable in 14.

B

T21, so does anyone else have huge experience with like huge ones.

C

I was going to ask just to just to make sure I understand when you're talking about the really slow trimming, the snap trimming that was also on nautilus right.

B

uh Sorry this is on octopus. I uh arthur this is an octopus cluster, yeah yeah. This is octopus, I'm wrong. It was 15-13.

C

I'm wondering if that's the thing that's changed. I thought josh brusher on earlier made it up. Okay, yeah, I think that'd be a question for josh and you had to see whether that that behavior has changed.

B

um Because we I we found a ticket which said no progress in snapshot trimming, but this was another user that was trying to delete hundred million files, and I think it just takes days to to trim 100 million files.

C

Yeah I mean, I think in principle what the the um the new scheduler should do on the osd is that it should, you know, go at full speed, but at lower priority than than client work.

B

Oh yeah yeah because okay I'll write it here for them, I'm not sure if they look, I'm not.

C

Sure if they looked at this specifically so um this would be a good, probably a good, like micro test benchmark whatever. So I queue up a whole bunch of snap trim work um and then make sure it makes good progress with um uh on an idle cluster and also in competition. I guess with client workload.

C

All right, 15, 213, okay,.

C

Good to hear that the the set of s issues resolved in pacific, though, have is have any of these production clusters been upgraded specific. Yet, on your side,.

B

uh No we're we're just upgrading the first clusters to where we're doing octopus now we're still on nautilus for most things. We um in there was that long thread about the cephem containers stuff. So we didn't chime into that because we didn't. We don't have anything useful to comment there. So this week we started playing with cepheum so that we can have more like tangible feedback.

B

The initial, the initial, the the problem that we have with with cepheidm, is that we have. We have a mixed environment, so we have to like our base. Machines need to be puppet managed, so we have puppet to bring in like time, synchronization and the young repositories.

B

So we need to have cepheum not manage that stuff, which all seems possible with the with the options of ceph adm and then the the second thing is that most of our stuff clusters don't have access to the internet.

B

So we need to mirror all of the images which is possible for the the base container image, but we didn't find a way yet for the monitoring images. Those seem harm hard-coded in the in cepheum, but we we're just still in the middle of testing this.

C

Yeah, I think I think they're just config options on the set manager module. You can set all those um container names.

B

Okay, yeah, it should be yeah. We found that for the for the image, but not for the not for the monitoring ones. So, usually.

C

It's an option under it's manager, slash stuff, adm, slash.

D

C

I see roxy container image or whatever it's like a stuffing.

B

C

Yeah, so it's a little in a slightly different place than the cesh option.

C

Yep, okay, on the um on the on the time synchronization, um my recollection, isn't, is that um all stuff idiom does is make sure that either ntp or crony d system d unit is turned on, but it doesn't try to configure it for you. It just makes sure.

B

Okay, okay, yeah. I noticed that in the log it says, there's a check in the host check, yeah yeah.

C

I don't think I don't think it should as long as your puppet yeah, I don't think I don't think they should conflict.

C

um All your puppet stuff should really just make sure that docker podman is installed and that time synchronization is turned on and um that lvm2 package is installed and that python 3 is available like that's really. Basically it.

B

Yeah, the one other thing I just remembered was that everything we have to do everything with stream eight now so we're using we're using uh streaming and there's a there's, a issue with pod man in stream, eight. So this you have to it, uses podman, 3.1 and there's some kind of bug. I don't remember the some cape, some capsis or something I forgot this uh there's a there's, a bz about a bugzilla about this.

B

Anyway. Does it work around you have to roll back to container tools, 3.0 and then everything works, yeah, okay,.

A

Dan, if you ever remember what that issue is, can you post it in the the notes, because I was about to go to stream 8 on some of my clusters and upgrade as well.

B

Yeah I'll find it and I'll put I'll post it. It's just it's just a quick search here.

C

Yeah you hit with adm, or was it just a core podman issue.

B

uh Well, when you it's a bootstrapping uh when you, when you're, just initially bootstrapping the cluster, the first mon doesn't start.

C

Right so that I believe that's fixed, um it's certainly fixed in specific, I'm pretty sure we backported that fixed octopus too, um but yeah basically.

B

Yeah, I think it's something different. I think it's something different, because I found that I I upgraded to pacific. Well, I I took the script from 16.2.4, so I don't know if oh yeah, that should have worked I'll, find it and link it in.

C

Yeah, it depends on if you're, when you use the bootstrap script, it's like the um that's used for like starting up the initial set, but everything else is deployed by the manager um and it will redeploy things, and so um it matters more. What container image you use so, if you're installing octopus, um it might have still anyway, I did the bug that I'm thinking of was that if you specify privileged to podman and also capsis, it gives you an error as they're, redundant or mutually exclusive or whatever.

C

B

So maybe it is the same error, but it.

C

Should be, it should be fixed now and if.

B

E

See if you identify version.

C

Where it's not fixed, did we just start missing a backboard.

D

On the upgrade question, we haven't gone anywhere near pacific yet, but we did just recently upgrade our big production cluster from luminous to octopus. um In I mean we stopped briefly in mimic, but it was a. It was a sort of one day process.

D

We started the start of the day at luminous and ended the day in octopus um and got uh it's about a 20 petabyte raw capacity cluster, and we got all that upgraded without, I think any disruption service, which was quite nice um and we're now going to probably look at pacific on our test cluster. We did encounter almost an outage. um The the cv fix about insecure global id reuse.

D

um uh So, on our test cluster, we got the warnings about. You know, you've got clients using it and your mods allow it and we upgraded all the packages on and we started all the self demons and, and then the cluster said you have many clients using the old thing anymore, so we disabled it uh and then our openstack started misbehaving um and the thing that isn't quite very obvious from the document from the documentation is. You have to restart all of your virtual machines.

D

Having um done the upgrade before, you can safely turn off the insecure globality thing um yeah and I I I think that there could have been a bigger scarier flag in the documentation that about that, because I I thought you know, we wait. Wait till um the cluster says young running clowns using the old thing anymore, it's safe and it turns out. That's not quite the case.

D

F

You you have to restart those uh virtual machines that are using the old libraries yeah yeah. That's that's really important for openstack environments.

C

Okay, I'm surprised the um the the warning went away, even though you still have the answer. You think.

D

Yeah, it might just be because, with this earthquake, it might be that none of them were doing. None of the machines were doing anything much at that at that point, because it's just a test cluster, so sometimes there's very little activity on the openstack clients there. So I don't know if it was that or something.

C

Oh yeah yeah. It has to be like an active connection. So um so, if the virtual machine is actually running, then the warning should have shown up. But if, like vendor has a ceph credential that isn't actually like issuing any commands or something like that, then it won't show up.

C

But but if you had a running vm, it should have appeared.

C

Something, I guess something to watch out for.

A

Matthew, when you did your upgrade, um you run like debian or ubuntu right.

D

uh We're about to yeah.

A

All right, I thought so, but if not, I was going to ask if you did like us, sent out seven to eight stream at the same time or not. But I thought you didn't have that python problem like the rest of us.

D

Yeah no and the um the way canonical produce packages you can keep the staff upgrade and the underlying operating system upgrade distinct, so yeah we're gonna upgrade to 2004 later, but probably later this year. That won't be me because I'm leaving the sangha, but that's another thing, um but we can now upgrade from html4 to 2004 and the seth version will stay the same. So um but that's quite.

A

F

uh Did did you change any any kind of defaults, because I bet that on that side of the cluster you have some specific parameters to tune up the cluster. So did. Did you see anything when you were upgrading from other one, the newer release, any time outs or things like that.

D

I don't think we had to make any specific um tuning changes between luminous and octopus. um I thought so. We used seven spot and obviously our set ansible needed quite a bit of reworking and um so that and then the servants will change changes. What the rados gateway demons are called, which is a bit annoying, um so that meant we had to be quite careful about starting and stopping and restarting the demons of the new service names.

D

um But I don't think we found particular tunables that we needed to change so far from uh littleness photographers um I mean some of that will be servants for some sensible defaults for you, um other than that. I think we kept the tunables we set previously and haven't really done that much with them at least so far.

D

A

Our point of view.

D

It was a pretty painless upgrade actually.

C

Did you did you turn on telemetry when you're done.

D

C

Not yet actually it's on.

G

D

Cluster there's a there's, a bug. In the version we've of octopus we've got deployed on production. uh That means turning on telemetry doesn't work, uh but it's because it's a slightly old version of octopus and canonical have now rolled out a new version of the oxford packages that will fix that. So that's that's slated for next week's at-risk period, um but it's on the thereafter. We are planning on turning telemetry.

C

C

Are you I you might have said this and I maybe I missed it, but are you planning a transition to um that's idiom now that you are an octopus.

D

uh I don't know yet um so we we use ansible for everything um and so using seth answer to manage chef makes quite a lot of sense because um we have stefan's ball and then we've got some of our local roles and we do everything like that. um So I don't know, um I think, not immediately uh yeah, it's one of those things. You know we're. We've got a lot of experience with stefansville now and we're quite comfortable with it.

D

G

So I guess I can add a couple of things.

G

We had a curious issue recently where the lock segments didn't get trimmed for a few days and we ended up with tens of thousands of of the logs or segments and that was eventually fixed by restarting the mdss.

G

So that took me a little while to be brave enough to do that and afterwards it was fine and they flushed immediately, because it was a bit strange um and I I'm also not quite sure what we did uh other than I. So we've got a cfs it's not particularly massive, but I was doing another thing of 40 terabyte, so file space with millions of files. So maybe the mds got too busy in some some way and we've got two active nds's.

G

So right now is a bit strange and the the other bit that's a bit more exciting. Is we get to switch off our cluster on on the weekend for some power uh upgrade, and so I figured that what we are going to do is we switch off all the clients, first disable self-reference and then follow the instructions for switching off the cluster?

G

Is there anything I should be aware of.

C

I don't think so. I just turn it off.

C

A

Decline, a bunch of flags like to know whatever flags, to keep the maps from changing yeah.

F

Which steph version you were running? uh Nautilus.

G

So we are only nearly the most recent one. I think it's the the one with the security fix. That's that's one where we're on at the moment, so I think that's 21, maybe.

C

Yeah, I mean the way that the um the mds works with the the log segment segment trimming um they're, basically just like a reference count um that has to go to zero before it can drop the log segment from memory, and so there's, probably just some really subtle reference counting issue where it's not letting go of that particular segment um and just restarting the nvs clears that out, because it's sort of reading it fresh from the.

C

Yeah yeah one of those like mostly harmless bugs that are like really hard to track down. Probably, but who knows I wish patrick was on because he would probably have. He would uh probably remember whether there was anything that's been fixed there recently or whether this is something that he's.

A

A

I guess a general update from me is we're considering uh switching from using our web radars uh safa sas, we'll probably make a few people on this call. Happy um we've been doing some performance.

A

Testing of you know, throwing our three or four thousand cores at it and whatnot, and it's been looking pretty good to support processing uh just straight on this ffs volumes uh without copying data to like the local processing nodes- and uh you know it seems like a file store, this uh bug report or whatever out there to deprecate it in a couple versions.

A

Get off the right path is just go to stuff fast and start converting some of the blue store, and some of the notes of blue store is just can be a long process for the you know, 11 petabyte cluster, but just needs to be done.

C

Yep yeah, the the like percentage of file store, osds, is sort of steadily dropping, um but there's still quite a lot of them. I think it's.

E

H

C

You look at um that the telemetry it's only like 20 clusters, I think total that are better have telemetry on that are still reporting file stores. So.

A

Not too many yeah, probably two or three of those in mind, so.

A

Definitely not being used anymore and I need an exit plan to get on the blue store.

C

It it used to be the plan that, um uh once we had sort of orchestrator support for the ost and with stuff avm, then we could use that to automate the transition from um bell store to blue store.

C

But we ended up not supporting file store at all in stuff idiom, just because it was too annoying tedious and obnoxious. So yeah, it's a requirement before you can make that switch.

A

Yeah that'd be nice, and it's just the way we do it. Since we use radios directly, we still have a whole bunch of extra steps to.

A

Actually move the files, this ffs.

C

Is the is the plan to create this ffs volume within the same cluster and just migrate? The data that way or incrementally as new data comes in or are you.

A

Yeah pretty much, I think, what we'll end up doing is I'm going to get some we're still on nautilus, I'd like to at least get up to octopus um and then take an osd's on pedro zieger or something um pull them out, rebuild them as blue store and build a new cefs on that with erasure coding and then just start. The slow transition process uh by moving from autorados on the set ffs and, at the same times just start shrinking the file store osds into blue store ones and coordinate it. That way, I.

A

F

By the way, I was looking this this edit.

F

Stream, so how about the staff packages? Because on a center stream, you can get also safe packages that are packeted somehow different way than more.

F

A

I'm not sure what uh I thought stream refills are what version they're packing. First, let's just use the primary repo.

F

Well, at least the signature is different. I didn't check through true if they are different or not, but.

F

Just curious, if if don and the guys are using packages from stream or somewhere else,.

B

Right packages, for what? What do you mean.

F

B

We don't use the rbs, no, we I mean we don't take ceph rpms from the distro at all. We take them from either a mirror of download.seph.com or we have some local rebuilds in a. We have a local kogi instance where we build stuff.

B

I didn't I didn't get what I mean. What where, where else would you get them from like within within stream? There.

F

There is a stream stream that you can get deaf packages. I just noticed. Oh.

B

Is this like the sick, the the storage sig is that.

F

B

Okay, yeah, we haven't, we haven't, we, we haven't used the storage sig.

C

Are they up to date, those packages.

B

I think recently last I checked they were well that was months ago now, but sure there was some new activity there. Some new.

H

B

Sage, I don't know if you saw, but I put a link to that bugzilla. It's for the.

C

This yeah yeah, the one that we one we hit was capsis. I can't remember if it was that one or not, but I believe it, it sounds like the same issue.

B

um I mean checking the mailing list, I didn't see cat perform mentioned anywhere.

C

Yeah, I'm just I'm guessing it's the same same core bug, though um okay.

C

Well, let me double check the code.

B

There is a tracker about this.

B

F

I have a question, may maybe someone here can enlighten me? um There is a some bugs or missing functionalities in nautilus and in pacific current back porch, and there is a on the both things that I'm waiting or I would like to get on on production.

F

There is a code for for it in the master, but bringing that on on a on a test environment, for example. So back porting yourself, it's crap getting really complicated, because the code base in rados gateway in a rudder's gateway, for example, is is getting uh at least for me. I I see that that is difficult to backport or you just would like to well.

F

I need to ask someone someone to kindly backboard those back bug fixes from the master, because I don't have the skill to add all the changes and what are the new newest, quincy style changes on the rados gateway and what are the ones that I I really need for backboarding those those fixes from the master to nautilus or to pacific.

F

So how to do it? Can someone help me? I I have two bugs and there's a.

F

Kind of fixes already for those, but but I I fear that I cannot get them to nautilus or octopus or pacific unless I know know how to do them because they are openstack keystone, related bugs with swift.

F

uh We discussed that earlier and uh I think that walowski did some patches on a master, but they they are in a limbo. Currently, I cannot test them because I don't know how many different parts I have to back port in order to get them on on a my test, environment or how to compile them properly without pulling them from the full cef development code.

F

H

Basically, the question is that is there an easy way to backboard something that exists in the master repo right now.

C

Yeah it looks like kc's are not I mean, I think, that the larger context here is just that: there's a lot of refactoring going into raiders gateway racing recently, and so every release a lot of turn there, um and so just backboarding generally in general, is hard.

F

Yeah, it's it's really hard to backboard things, and I I personally would like to have a somehow.

F

uh If there is a refactoring of rados gateway, is there any possibility to bring those brothers gateway, functionalities on all kind of stable releases with the same time? So I know the nautilus is going away, but still on octopus and pacific, I would like to see the same code base that we have. We are running in master because.

E

Sorry, I will refresh this pull request. Maybe it will be merged into master, I think it's still usable or if it requires some changes, I will made it. I posted the link on the chat for that.

C

That just need to get rebased.

F

Yes, but if you start yourself trying to rebase that from the master it it will get really complicated while you're administrating the clusters, I know I'm not the developer, I'm I'm administrator, so it takes time to figure out that what are the cherry picked parts that I have to put on or not system in order to maintain rados gateway on an altitude, for example,.

H

C

Yeah, well, I think the first step in any case is to get that merchant to master and then that that change looks particularly great for it. So I don't think it'd be, should be back portable.

C

How many, how many people are using the swift interface for gw?

C

I haven't heard much about it for a long time.

B

We use it only, we don't use it for um so we have everything integrated with openstack via magic, which I don't understand. um We use the swift interface only for because the users can query their bytes used and quota through the swift api, but they can't get that through the s3 api.

B

So we enable this for people that want to monitor their own usage. Interesting.

D

I happen to know that wikimedia used the swift interface for the rattles gateway for their media internally, even though they don't run openstack. um I don't know why they use it rather than s3, um but that that's primarily what they're using set for.

F

ah I can in light at least at cse side. We are using swift because there is expiring tickets, so you can you or the authentication is expiring within eight hours hours, for example, with the s3. You would need some sts cred credentials in order to make same same kind of functionality.

F

So you would you wouldn't give your uh credentials on a hpc environment like fully open, like with the s3. You have now to give your passphrase somewhere, but with this rift you authenticate yourself and then it's active for a while, and then it's gone and that's my second challenge. I would like to get that sts hard on the safe, rattles gateway working better.

F

There is a now some progress already for that, but that that that, for example, would would give possibility to get away with a swift, because I think that's not maintained quite intensively nowadays, comparing the s3 apis.

C

Yeah that makes sense.

I

Pcc, uh this is a computing center. In poland, we are keeping swift interface because some users, actually internal users, especially the digital repository people, started to use c suite like five years ago, and then they don't want to stop. uh We are killing this, the original swift instance and moving people to uh self-based, swift and actually partially done, but still it's hard to them to get rid of swift because of some software dependencies.

I

That's the reason why we have it, but it's not a big big installation.

C

Is that because of like use of the swift, middleware stuff or or because on the client side,.

I

I mean they have, you know they have actually implemented their own reprocess software using street interface mode as free and it's like heritage. We have kind of and they're not really winning or having funding for we redesigned the software part.

F

In our site, there is a more than five petabytes of swift use it currently. So I think that I cannot just drop away the swift functionality.

C

I didn't realize that um wikimedia had switched over to using lyft on stuff. Last time I talked to them. It was probably like eight years ago or something, but they had they're actually deploying both proper.

C

That was a long time ago.

B

There was, there was a mail from one of the wikimedia guys on the user's ml today.

C

Confessed, I don't read the set users list.

C

There's a lot of a lot of.

H

A

Well, I think we're about to run into the next one for the go talk so.

A

Thanks everyone.

H

Thank you. Yes, yeah.

A

I guess um regarding, like our you know: bi monthly get together, um I'm kind of leaning towards skipping july, since a lot of people on vacation- including me, probably, and maybe we just do august or september- for the next memory chats uh send out my usual emails and let everybody know.

G

All right, thanks to see ya.