Ceph Science Working Group, 23 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-10-23 :: Ceph Science User Group Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So we all most of us all were together about a month ago. Don't know how much it was to share here between us all.

A

The cern cefta was pretty awesome. Thank you, dan and julian for that.

B

um What was our pleasure? I don't know if we need to go through really in detail. I just wanted to quickly say that um there's videos posted there now uh if anyone wants to go through uh we're trying to work with mike to get them or mike mike is mike, is going to put them onto the channel on youtube. I guess, but we have a problem on our side.

B

Some of the videos aren't the right length when you download them so once we fix that, then I guess they'll be posted, we'll be uploading to youtube and uh yeah.

A

B

Yeah, I wasn't.

A

Gonna go into detail at all either, but I just noticed the other day that the the videos are up and was gonna mention that people want to go watch them feel free. I guess.

A

One thing that I was interested in um in the topic, sir, was how tom said they were going to add a thousand osds. I sent them an email and they haven't done that yet uh because it's curious to see how that ended up for him. So maybe uh next time he'll be able to give us an update on that.

B

Yeah I mean I've spoken in before you. You have a question about single large versus uh last time. I asked them about this. They have a basically like there's a the higher level software like expert d that they use to expose their death pool. um It gets complicated if they need to if they need to expose multiple sap clusters.

B

So that's why they prefer now to have one single large subcluster and just push things as far as they can.

A

They're paving the way for the rest of us yeah exactly.

A

And that's also just like a general thing to like anybody else who had input on that. Like are you preferring for your setup, one big cluster or you know multiple small ones, for what's easier? What's better, you know.

C

I'm not sure at what point too large is too large rather, and you want to do that rather than having several small clusters, but we're about just over 3 000 and not feeling the need to start splitting that just yet.

A

Well, I think that one of the interesting things with what tom mentioned the stuff that it was it takes like three minutes to do like an osd map update.

A

Are you like noticing that time go up as you get to that 3 000 and more osd's in your cluster, for a map update or it's fairly still quick and easy.

C

We've not seen it as a problem as yet. um I don't know whether it's just that our monsters are recently nippy and their monitor. Storage is solid, state drives or something, but in terms of map updates. When we lose an osb or one comes back, I've not noticed that being a significant impact.

A

B

I mean on this subject. I we we've always preferred to keep the like: keep the s3 stuff separate from the the block storage and keep the block source aside from the cfs. So.

A

We don't run any.

B

Mixed use case clusters, and that's like um for practical reasons, so that, like with the with the block storage stuff, you can't always upgrade the clients. So I, although we haven't, had any tunables um changes needed recently, but you know in the years past we were having to upgrade always teamables with new versions of stuff and the running running vms were holding us back, so we separated for that reason, also yeah.

B

If you've got different types of users, I don't know how to I don't know how to do qos with s3 and block storage. At the same time, it's basically not possible, oh sure,.

C

And we now have a cha proxy in front of our advanced gateways, which we're currently mostly using to um stop one particular user or setup users from dominating the service. But I guess we could also use that to tune the level of s3 availability so that it didn't overwhelm block storage as well. But that's not something that's yet to be necessary for us.

C

I guess maybe, as we're adding more rattles gateways that might become more of an issue.

B

Yeah we have. We have some users that do like infrequent big uh like like, like they kind of hammer the s3 infrequently and yeah. I don't think that, would I don't think the block storage users would be happy during those during those moments.

B

B

Can does anyone have anything to say about nautilus? We uh we just upgraded one cluster to nautilus last week, so someone's curious about that. We could say maybe a couple of things.

A

I've upgraded two small clusters to nautilus, but not my big one. Yet.

D

From which version did you upgrade plaster from luminous or mimik.

A

I did mimic nautilus.

B

Same here mimic to nautilus and it's uh it's a rattles.

A

Only cluster that.

B

We did no 7 s, no rpg, no s3.

D

Okay, and did you encourage any problems.

B

Some small bugs um like uh I think I think that non like any any any client which is not nautilus, can't run status. For example, you get this bugaboo type, um oh, and we found we have one. We have one issue with um something like 10 percent of the time that we stop an osd in nautilus.

B

It gives a back trace when it like falls, we're not cycles. It starts while shutting down so there's something like the deconstructor of the osd is not cleaning something up.

B

And I found a similar bug back in in luminous or nautilus or luminous or mimic, and they have already fixed it, but it seems it's still there, nautilus. Maybe it's something different.

B

But otherwise the upgrade was fine.

A

A

I was just to say the other great process for me from mimic to nautilus was a fine, no issues.

A

um I haven't run into that osd restarting bug yet, but I haven't been bouncing many osd's in the clusters that upgrade so maybe I'll. Try that in my test cluster, if I get it as well.

B

We see it when we restart the osd's to do the stats conversion or whatever. It is like the thing that the thing it gives a warning after the after nautilus upgrade saying that the the blue fs stats are the old version or something and you have to convert them offline.

B

So when we do that conversion, we scripted the conversions across our whole cluster and something like 10 of the osd's gave a gave us call given.

A

I don't I recall having that when I rebooted for that blue store, uh stat stuff.

B

Probably specific to our workload like those things are, our osd's are hammered pretty heavily while we're up.

E

D

Maybe somebody tried to upgrade directly from luminos.

C

So we haven't done that, but that's what we are planning on doing, because our main clusters are all running luminous um and canonical are providing luminous and northwest. So we're expecting to try upgrading from luminous northwest directly.

C

But we haven't done that yet.

D

Yeah in our environment, we also preparing to upgrade directly from laminos, so we want to skip mimi.

D

And we are setting up right now, a test environment, so I can share our thoughts, maybe I'm in next meeting what goes good. What goes wrong.

C

Thanks yeah I'd be interested to hear that.

A

Yeah that'd be useful.

B

We we also will do that, similarly from from luminous to nautilus on one of our set of s clusters, um but we're holding back at the moment, because I think there's a there's, a known best bug in nautilus 1424.

B

um There was like a threat on deaf users a couple of weeks ago, and it seems that the mds in 1424 has a bad back, has a bad like fix and a bad patch in it. That was like incomplete backported from master branch, so yeah we're waiting until that's wait until 1425 or something like that before we upgrade any cfs cluster.

B

Did anybody. I guess, nobody's running cfs on models yet right.

A

I am but it's just like a small use case of it.

B

I can try to find that so.

B

I'll try to find that chat.

F

Sure we're running nautilus with cfs and haven't had any issues.

F

Are you sure that was 14-2-4 or was it? I thought it was 14-2?

F

I remember the discussion. There was a quick patch put out. I thought that sort of undid it all.

B

Yeah, I'm just trying to find it now. I will uh I'm pretty sure it was 1424 because I've been like waiting for that fix, but I'm not I'm not finding it immediately, uh yeah I'll I'll come back to that later. If I, if I find it.

A

Rafael, what are your use cases for seth.

D

Okay, so uh our company have several clusters: we have one big cluster. Now it has more than four thousand osd is running and five uh rados gateways. This is uh object, storage, cluster for satellites data from european space agency and this data reserved to our cloud for clients, and we have several smaller clusters about 1000 osd for forever data. For for our back office data, and now we are deploying new free clusters for uh a method company for for process weather data also, so we are running mostly luminous right now.

D

These three new clusters will be nautilus in 40 to 4.

D

And we are trying preparing to to upgrade this biggest cluster, because this is. uh There are some issues right now and we hope that the decisions will be gone with new version.

A

Gotcha, what um what issues are you referring to.

D

Right now, these clusters are running 12 to 4, and I saw some bugs resolved in the next issues next releases in luminos, with radar's gateway, performance issues and also we saw some problems with managers when we have some problems on network and our asd as the demos are disconnecting from each other.

D

The managers goes down and there is a very slow process for starting them up again because they got a lot of data and they're hanging mostly.

A

Yeah network problems are never too much fun.

D

Thank you dan for link for this issue with ffs. I will check this because this new cluster, I'm trying to run the ffs tool.

B

Yeah, I'm not sure which condition you're the bug it might. It says something to do with client snap caps. So maybe this is the affected. It's not going to be triggered. If you have snapshots enabled, maybe you don't? I don't know if that's true, but I'm waiting for you to do five. Fourteen two three and fourteen two four are identical releases. There was just some set volume fix. If I remember correctly yeah I saw.

F

I can say for sure that we don't use snapshots at all, so that may be why we haven't hit the bug yet.

B

Okay, good to know have you tried, can I take it on snapshots? Have you tried them in the past when you decided not to use them or just scared or no we've.

F

Never had a use case.

C

F

We're sort of trying to to move people off ffs and point them more at or s3 for our you know for the way we try to provide storage to people, we're less local storage and more, you know more suited to things like s3. That will be more cloud-based.

B

Are you using um like replication between s3 clusters or zones or regions or whatever they're called.

F

So the way we're arranged is we're actually at ceph cluster that crosses three institutions. It's stretched.

A

F

So we don't use any kind of rados gateway, replication.

B

F

I could see if we can get more people leveraging s3 and if you know, sites came online in the future. That would be maybe a better way to move forward, but our sort of early goal was to try to provide both protocols at the three institutions, and so the idea was we wanted to enable cfs and have a stretched cluster. That was all one name space for that too.

B

Is anyone else in the call using multi-region s3.

B

We are supposed to try this with a with some new hardware that we're getting in in january. So I guess we'll. If we get that working we'll be back.

A

The last call you mentioned that you guys were looking is some of your stuff clusters are like five plus years old now, where the hardware is and you're looking to upgrade. Is that what that's for.

A

Replace the hardware, I suppose.

B

uh This is actually, this is something different. This is, um this is just more s3. Capacity in a different building. This week has have some bigger use cases on s3 and we're so we've yeah, it's just a new cluster, but some people want they want disaster recovery. They want to be able to like write to a bucket in our main data center and then magic happens, and it's replicated to a second data center.

B

But we don't want that for every bucket. We want that for just some buckets and be able to enable like say this bucket, okay, it gets replicated um without the user knowing and then other buckets. We want to have them only in one region.

B

So I don't know how that all works, we'll figure it out.

A

Is anybody uh facing any big problems with stuff right now? Bottlenocks.

A

Something holding them back.

A

That's not everything's running great for everyone.

A

Anybody else have any topics you want to talk about, definitely not limited to. What's in the pad there. It's just ideas to keep us from stalling out or whatever. Here.

B

So I have a kind of crazy idea and crazy question. Maybe I wonder if anyone thought about this before on some of our servers, we have um lots of discs and few ram so like 30 discs, 64 gigs of ram- and I was thinking so we have already done everything possible to um that.

B

The osds use with the osd target memory option, but I was thinking of trying this zram compressed swap compressed like ram on these servers to see if it helps in any way and on I've already configured on one server, the zram swap so like. Basically, you take half of the ram. You create zram block devices which are compressed um block devices in ram, and then you enable like linux swapper to just like, swap out inactive pages to that compressed ram and maybe maybe buy this.

B

I can, I don't know, add 50 more ram to the to the servers. Has anybody thought about this or tried something like this on their servers before.

A

We haven't nope.

B

Matthew, you have big, you have lots of discs on your servers.

C

B

C

um Our typical servers have um 60 osds and then um 500 and something um gigs of ram.

B

Yeah a lot of ram. Okay, did you ever.

E

B

Try to build something with much less ram.

C

No, um I I guess we found that these things have worked for us in terms of um like we're, using a fair chunk of ram on them and but we've not had you know, hitting swap issues. So from that point of view, we've just continued to buy more machines at the same spec um yeah.

C

You know both the ramp is a fair chunk of the cost server. But it's not it's not so big, a cost that we're particularly inclined to see if we can shave a bit off. So I mean.

B

Yeah, our new servers will have 48 discs and 192 gigs of ram, so that gives us a bit more breathing room, but still I'd like to be able to like configure more I'd like to be able to get bigger more like give the osd's more memory and compress it. If that's.

A

So this this is that ram, but that's just taking a section of memory and basically compressing it and then serving it as a swap device exactly.

B

On ubuntu, you can just you, can just act, install zram dash config, and you enable a system d unit and it does it. It takes half of your ram and turns it into a compressed ram, swap device.

B

So yeah these machines, my machines, have no like spinning disk or ssd, swap they just use this, like uh this ram swap thing.

A

Yeah, I could try this on my laptop. I always uh out of.

E

Memory exactly yeah.

B

E

How I found out about it on my on my laptop.

B

I saw this feature and I thought.

A

That that's an interesting idea. I tried that could be worth trying my solution to when I some of our early gen servers for stuff when we were low on memory and swapping that just added more memory now I buy way more memory than I need just avoid issues.

A

For the last stuff I've been buying has been.

A

Gig for 24 disc- maybe I don't remember.

B

Yeah, that's that's outside of our.

A

Yeah 256 or 24 discs the 12 terabyte disc, so.

A

If you beat up dow long enough to tend to give you good prices,.

B

Yeah but they'll give you a better price with half the ram. um Do you what kind of we're going to put in the block.db? What do you use.

A

You broke up a little there about anything.

B

I'm just wondering what you guys use like if you're, if you're buying stuff now, what do you use for the for the rocket to be.

C

File store use, nvme cards.

B

So what kind of uh nvme card and how.

E

Am I doing now you're asking.

C

um We have a pair of them, I think they're, two tv, each so that's um 30, partitions uh nvme card.

A

Okay, if you have a pair of them, do you like raid 1 the nvme cards for redundancy there? So it's not a single point of failure, or do you split your drives? 30 and 130.

C

um No, so we have 60 osds in the box and each of them gets one partition which will be on one or other of the cards. So where we've well, we've had an nvme device fail. We've just had to rebuild that 30 osds, um but you know that's: um we've got enough redundancy in our cluster that we can just do that um and at the risk of tempting fate, we've only had one nvme card go south so far and we do monitor their um where.

A

How fast do they end up wearing like hands on workload, was just.

C

A

C

No, they I mean they seem that I I think at the moment, the most worn of them is still got like a 95 percent of its where life left um and that's after a couple of years of use. So I I expect we'll probably end up replacing them before they get anywhere near the end of their rated. Where life, so I mean.

F

I can give a data point on the nvme, where too we've got a kind of a similar situation. We have four nvmes in a node 60 disks, so 15 osd db devices for nvme and even before that file store. They start with file store. You know four years ago and our oldest, you know four year old, nvme they're down to about 60 percent of their available life. So um yeah, I think I think you know lifetime should be plenty, at least in any enterprise class nvme device.

A

And if you came from using file store with those same devices, they probably saw a lot more writes. You had the journal on there than they're doing now, with the using bluestone just putting the laugh to be on it.

F

Right, I suppose so yeah file store used to be pretty bad about amplifying rights. Isn't that true.

A

Yeah, because everything will get written to the journal and then from there to the the main storage osd right.

F

Those are small, nvmes too, like the first ones we bought were 400 gigabytes apiece, um you know, and over time, they've become larger for the same price, just naturally, um of course, now they're too small for a blue store, because you you need uh some.

F

You know if you follow the discussion on that on on stuff users, at various times of where you need some multiple of like the the roxdb levels in order to be effectively used, and if you, you know, can't fit a whole level like it's like 30, gigabytes and 60 and or 30 and.

A

F

Gigabytes are the there are cut off points and, if you're in between that it doesn't you just you, just don't use your nvme space. If you get a warning on the cluster about spillover.

F

I heard, though, they're working on making it more. I heard they're working on a more effective allocation for that in upcoming versions.

F

So it's not so critical that your db exactly fit. You know one of those rocks db level sizes.

B

Yeah, I think that's what sage is talking about. When he talks about sharded rocks tv, you can run several rocks and then it'll fit it'll fit them into the spacing.

F

Well, that's good! We sized our most we're our newest purchase this year. We we kind of went by the discussion of uh you, know the level one of the levels is 30 gigabytes, but leave space for compaction, so assume 60 gigabytes. So so we sized for that, but we have some space even above that just because of that's the sizes they come in.

F

So if they do do sharding at some point, then we have a little extra space that could leverage that.

D

In our big cluster, we developed a workaround for that.

D

We configure the rxdb options to fit our partition of nhlmi device, and then we have a running drop in our ci cd, which are compacting offline device until it fits in by may partition. Only and you you have to have enough copies to be sure that when another drive will fail, that you have still data.

D

But it works for us right now, but we are waiting still for bug fix for this spillover.

F

For us, we've only got some nodes that have this issue, so I guess we just decided to let it be and disable the warning like our earliest batch of purchases.

D

Okay, in our the biggest cluster this with 4 000 osds, we have about 1 000, as this wave spillover.

D

So it's a little number.

A

F

A

Hyper-Converged, where you're sharing your compute and storage same notes.

A

How you considered it.

B

We're doing that you have an hpc cluster um running in jobs and then there's four four osds on each of those. Those nodes is like: okay, maybe 300 nodes like that.

B

I have it right, maybe maybe maybe half that I don't know um yeah. It seems fine, the there's, no technical problems. I haven't seen any technical problems. We reserve enough. We can figure the hpc cluster so that it doesn't use all the ram.

B

um The users like really hammer the cpus on those machines, but it seems okay anyway, like nobody complains about this being slow. This is what's stepping fast by the way, um with the kernel mount, and I mean the bigger issue is that we have two different teams in our department, like one for storage, one for hp, like slurm main uh slurp support. um So it's like getting the operations working between our two groups is the bigger problem because they want to reboot.

B

They want to remove things, and we have to like do things in the way that will work for seth.

A

You know I was wondering about that too, if, like nodes need to be rebooted or you know crash occasionally, if that causes problems for that sort of setup,.

B

So we have like yeah like once a week nodes will ost's will go down if they're aware that they shouldn't, but they shouldn't like shut down. You know half the cluster, all of them, which they would normally do in the past.

B

Oh and by the way on this cluster, there was some. We have some maintenance on our on our like cooling or something, and we had to power off this whole cluster. So that was interesting.

B

There's a there's, a ffs option to like power off the cluster before you actually turn off the servers and it like it, flushes everything from the mds to make sure that there's nothing left to there's nothing left in ram or whatever, and it was fine.

B

We turned off the cyberclass, we shut down the osds and then turn them back on everything.

A

Yeah, there's a university nearby to me that they've been been talking with them about. They want to do a similar setup to that asset ffs that runs on top of their compute nodes with three or four osds per node and maybe like 100 or 200 nodes. I think I think.

B

It'll be fine. If you have very small clusters, then it can be. There can be there's some kind of memory issue where like if the client and the only osd are on the same machines, there's some weird internal race condition that can happen. But if you have large clusters, then the probability that the probability that you're reading from the osd that you're sitting on is usually quite low. So so it's never happened to this.

A

I let them know that it's working well for you and maybe that'll, give them some confidence. They were doing the ht condor status for their scheduler.

B

Yeah, it's the same same here.

A

I'm thinking of using trying a little bit of that like uh putting a condor start d on some of my stuff notes, because they're a bunch of cpu, that's idle. A lot of the time.

A

We're probably only allowing it to use a fraction of the available cpu or memory.

B

Yeah, we don't, we don't use c groups or anything like that like that, was we reserved that, like we, we figured out the configuration to do it if needed, but we haven't needed to.

A

Yeah, I think in the condor config you can just set your resource limits in there as well.

A

I mean you say: 58 cpu, 50 memory or whatever, and then it honors those limits internally via its grouping.

A

Anybody else got something bad about here.

C

A

Okay suppose we can wrap it up, um as for the next one, they'll probably be in january, we're doing every other month and every other month ends up on christmas day.

A

Gonna happen so.

A

Probably plan for january I'll make sure mike has it on the stuff calendar.

C

A

Without a reminder, at some point.

B

That sounds good.

B

Thanks again for organizing them, yeah no problem.

A

uh Nobody's got anything else uh thanks for joining in and we'll talk to you all in january,.

C

Okay, cool thanks ben. Thank you yeah. Thank you.

E