Ceph Science Working Group, 25 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Science Working Group 2020-11-25

Description

https://ceph.io/community/meetings/#science-wg

A

All right suppose we can uh get started. um I dropped the link to the pad in the uh chat here if you want to sign in go ahead or take any notes or anything in there feel free.

A

um Otherwise, welcome to our member call if you haven't joined us before it's just a bunch of people who work in. You know: research, computing and whatnot, and we talk about stuff for half hour to an hour and the problems that we see at scale or anything of that sort.

A

um If you have any topics drop those in the pad or feel free to just bring them up on the chat in here. um This is pretty free form. I just kind of try to keep the conversation moving between topics that are in the pad um yeah.

A

Does anybody have anything they want to immediately start off with or anything of that.

A

Sort not I'll just uh start running down the some of our usual topics here and some stuff, that's been added, um so anybody have any uh recent analogies. Do the bugs or anything like that that they want to. You know, share and talk about.

A

Thankfully, this time I don't have one.

B

That works, I don't know my camera works, though,.

A

Hey there, you are.

A

I guess, since nobody's speaking up, we've all had a great past two months and no outages or anything that a change very welcome, though um kinda along the same topic. uh Anybody hit any uh bugs. That's caused them headaches or you know not necessarily an outage, but you know just a general pain to have in the cluster for clients.

C

Very quiet couple of months, actually it's been nice.

A

B

There has been well the the big issue on our side is we're replacing we're replacing hardware for our s3 cluster, so that has tons and tons of little files, and this doesn't go so well when you, when, after a pg moves from an old osd to one of the new ones, and then it needs to delete that pg that pg deletion is really inefficient right now and you end up, you can get slow requests and osds can start flopping, so we had to enable we had to re.

B

This is this is with nautilus like 14 211. I think we had to re-enable bluefs buffered io that helped helped a lot and then well igor. The the blue store genius is is replaced, is is preparing. He has a patch for pg deletion, and I guess that would like fix the underlying problem, but we need we need buffered io in the meantime. Otherwise the cluster is not stable.

A

Are you just moving so many pgs so fast or what's like? Do you have a a lot of pg's or with some uh I think.

B

I think it's really, I think it's yeah. I think it's related to the number of objects in the pg, because the pg deletion like to delete a pg from a running osd.

B

That code is kind of inefficient because it deletes 30 it like it like loops through, like iteration by iteration, and it deletes, I think, 30 objects each iteration, but to find the next 30 objects to delete it has to start at the beginning each time. So this ends up like kind of spamming.

B

Your blo, your roxdb, querying, fill like like searching all the keys for the next 30 objects to delete, but then there's a there's, a fix to kind of like just keep a reference to the last time where it was, and then it can more quickly do this, but that I think that that looks.

B

I don't know it's not it's not in a release. Yet that's all.

A

It's interesting, it kind of sounds kind of similar to what I've seen on file store with, uh like after doing rebalancing, I'd see like some osd's running high utilization because there's still like an old pg hanging around in the directory- and I never got to because.

B

Of it, the end result is a lot like how file store was when we used to run that the deleting and merging what was it? The like, the the merging that happens inside the the file store directories was too too slow, but this is like a completely different cause, but the effect is quite similar, the osd start flapping, and then we had to pause. The rebalancing like set no rebalance and then find a way to do to make it more stable, interesting.

A

You know if they have a a targeted release for that patch. That eeler's worked on.

B

Yeah, so I can try to find I can try to find the it should be coming soon. I think I I I well I'll try to find the.

A

B

I'll link it when I find it.

A

Sure, maybe it's just yeah the next one, then 14 214, probably.

B

Yeah I'll have a look for that. That's not I'm not finding it immediately.

A

Here, along with bugs, I kind of just looked around at some of like the the recent releases and saw that- and you know, 14 to 12 uh fixed that osd map issue and 14 to 11.

A

If anybody was seeing that and then unfortunately, it seems like the 14 to 13 came out real fast because of the regression with the stephmons dns names and the models for the configs and, it sure looks like they're doing a lot of work on octopus and the uh release notes. There are quite long.

A

There definitely fix those same bugs in there as well.

A

Anybody have any um comments on you know: new octopus installations, upgrade procedures from an old version to that or experiences of that sort for from the last couple months. Here.

D

According to the box, uh we we have one uh with osd memory liquid uh in notions. After upgrading, we noticed that our osd osds are taking more and more memory, and we have continuously flapping this in our cluster, so we have one flap per hour in 5000 osd, so I talked with mark nelson from performance team and he agreed about that.

D

They saw this in their tests and under a huge load, probably on the right load. There are osd starting leakage, but, as far as I know, adam cook.

D

Didn't started, works about it and to stir writing and trying to to gain some time to search what is the root cause of that? So maybe some I started thinking about uh as the map stream issue.

D

uh Maybe it simplificates something about the more liquid here, but I need to test and look in debug mode, get more data.

B

D

S3 cluster. Sorry, yes, yes, s3 cluster, mainly with arranger credit pool.

D

uh We checked the blue store locations and it looks quite okay.

B

Is this I mean this happened to us one time where we had the the pg logs exploded. They became.

B

They became like so large that all the osds on a machine went out of memory and then we restarted and it replayed the pg logs and then all of that memory went from being in pg log mempool to being in buffer and on mempool, and then we shortened the pg log from 3 000 max entries to 500 max entries- and we haven't seen this since then, and but I have no idea why the pg log got so large for us that uh that particular it was like it was. It was like one.

B

It was actually immediately after we upgraded to nautilus. We went straight to 14 to 11. the memory usage was going up up up up up and then the p and the pg logs were like it seemed. They were never freeing their memory or something like that, and then we rebooted once everything and then there were and then we haven't had it happen again since then.

B

So I don't, but I never got to the bottom of what it was.

D

I will try it, maybe it says some kind of resolution, but interesting thing is when I saw this effect first time that was one week after upgrade from minister motilos.

D

I set as the memory target to the half of a default value from config and it doesn't help so I I was. I hope that it will block the rising memory usage, but it didn't, unfortunately, yeah yeah.

B

That's the same symptom that we have. We had it once.

B

Okay, I'm going to try.

D

B

Yeah, I put the link to the post that I put on the thing this was. This was how it happened for us and then it never it never reoccurred after this, but anyway you can read it later. I guess.

D

Okay, I will take a look at this and another interesting thing. We noticed back in probably beast after upgrade. During upgrade, we can't upgrade rod's gateways.

D

So we, after some testing, we decide the pullback to to to separate web and uh civil web are working.

D

Okay, it's uh it's working, uh a little lower performance, but it's working, but I will later a bug tracker that we, where we have uh uh investigation, why beast isn't working and there are some thoughts about riser, coded pools and big buckets and beasts can't handle a request to that.

D

So probably both of them are generating prices for us.

B

What was the bug, the beast bug? What was the symptom.

D

It was setting in the concrete.

D

Oh I got that. Yes, I got the comment about quality corrupting stock size.

D

And this change probably should help, but I didn't test it yet. I will test it in friday, our dev cluster.

D

But the interesting thing is, I will send later from tracker. There is isn't one point of uh exception during the working kratos gateway, there was uh three four types, so it isn't easy to catch. Why why? Why exactly? Why.

D

D

Oh, I have facial.

B

Interesting we had. We had one one issue recently where so we we normally do the tls termination in our in our front ends, which is traffic. This load balancing this load balancer traffic. Anyway, um we have some gateways that we we connect to directly with ssl, and we had an issue just actually this week connecting to those and they were giving like tls errors. I wonder if this is the same.

B

Same what you saw, even though I see this is all in the use, ssl.

C

D

D

I can write you how our tests are when we test it. I suppose, tested my website in friday, because for tomorrow I have another things to do, but it's very interesting because when we tested when test upgrade, when we test, when we done one upgrade when where we have unreplicated pull everything goes smoothly, but on our largest cluster, we had problems.

D

So there are probably problem with a razer code pool or big buckets.

A

Yeah it'll be interesting to see what your best on friday brings up.

A

huh Keep an eye on that stuff tracker! It's interesting.

C

Just gonna say for what it's worth um to do with upgrades: we've just upgraded two clusters: two small ones from fifteen two: five to two: six: um absolutely no problems whatever.

C

The the only octopus upgrade we ever had trouble with was the one with the uh osd world corruption, but that was one specific thing, although we're not using seth adm, don't trust it yet.

C

Like it's all been smooth and completely no problems, which is nice.

A

A

One feature I saw in um 1525 was uh the new warning about when uh an osd gets repaired too many times, which should be nice for trying to draft down those iffy.

A

Osds somebody added uh something about multi mds and they want to give up on it.

B

Oh, that was me um well it's just that when we uh so we upgraded to from luminous to nautilus our our biggest ffs and we used to have 10, active, mdss and actually upgrading was quite painful because you have to shrink down to one mds, and there was one step I think from three to two. We actually had 30 minutes of slowness because, uh like re-exporting, everything that was pinned to mds2 or mds3 to move it down to two was was not transparent.

B

So anyway, this probably we shot ourselves in the foot with too much um like subdirectory pinning.

B

But uh anyway we we went down to one upgraded, and now we have three active and actually like the the first one is doing most of the work and the the second and third are doing very little work. So I'm thinking that actually just one active mds might even handle the load anyway, and all this multi mds isn't worth the trouble. So I just wonder if people out there have like really high high metadata intensive workloads on with just one single mds or if people really do think that, like multi-mds, is needed.

C

We need a smallish load on it, but we did have one active and two standby. When I switched to two active and one standby, there was a noticeable performance improvement.

C

But it is only small with low-ish load, but they say it was noticeable, but we certainly haven't struck any problems having multiple but that's an octopus.

B

An octopus okay.

A

Why'd you go to 10 mds in the first place,.

B

Because why not? I did that when we did this this was. It was probably.

B

Over a year ago that we went to 10.

B

There were some there were some users that had complained about like directory listing slowness or something like that so by having 10 it just like kind of sharded the users over over different mdss.

B

But I think that in the meantime the mds has gotten more much more efficient and I'm not sure that that so many is is really needed. Anymore.

B

And what the other reason that I was looking at this recently going down to one is because uh I tried I wanted to scrub. I wanted to like scrub a path to fix some metadata that had gotten weird. Some some files had had disappeared or something, and then I tried to scrub the path in it, and I got the warning that scrub is not supported on multi-mds with multi-active mds, and then I saw that there's a so. This is now. This has now been just merged to master, and I guess it will appear in.

B

I guess pacific, but I don't know if they'll backport it to octopus or nautilus. But if you want to scrub a path instead of s, you have to go shrink down to one active, mds and then scrub, and then you can return back to e3 or whatever you.

A

A

Interesting, I wonder.

A

A

Do you plan on just sticking with the the three for a little while to see how that goes and.

B

I think next time we upgrade we'll decrease to one and then leave it at one and just see how that goes for a while.

A

Yeah cool yeah- maybe it has just gotten that much more efficient at handling the metadata and everything where you don't need, the multiple active. You know the one active and do standbys or something is enough to get by now I mean I assume you guys, have a pretty big load on it as well.

A

It depends it's.

B

Not I mean there's: no, we don't well we're just starting to get our first like physics, users on it most of the time it is used for kind of infrastructure stuff like all of our linux repositories and uh well, we use it for it is. It has actually the home directories for hpc, but but those guys don't really hammer it too hard. They they're pretty. They go pretty easy on it.

B

The heaviest workload that we have is the thing that that effectively needs to read and check some every single rpm at cern four times a day.

B

Because this is how I think yum yum repo sync.

B

B

Has anyone upgraded to the latest um like anything beyond? I guess I guess 13, because 14 and 15 were very quick like 14 2 13 is what I'm saying, because 14 to 14 was a security patch and then 14 to 15 was a fixing ceph volume regression. So has anybody upgraded to 14 213 and, like is happy with it,.

D

We are running on 14 to 11 only.

B

Yeah yeah, okay, okay, we're also waiting yeah, because there's quite some blue store changes in 14 to 13, which actually came in 14 to 12 but 14 to 13, notably it has this big change in the in the blue store allocator to use extra space, so it doesn't spill over.

B

If you have, if you have a dedicated block tv yeah.

D

Currently, we have works on split over problem because it fits us. uh We proposed a new volume selector for wfs. I will. I will find the pr.

D

We are trying to eliminate the slow volume or for roxdb.

D

And far as I remember.

D

It is approved, but not merged. Now.

B

So this is something on top of what uh of the if they use some extra policy that just got released. No.

D

No, no! No! No! No! uh You are talking about some um some fixes, maybe no workarounds from igor.

D

But we spent balance time or create the.

D

Oh, my god, I have problems with my mailing.

D

D

D

D

D

So we spent some of the time to um create some proposal to delete.

D

Usage of the hdd.

D

Drive so, as you can see, we create the third parameter to the volume selector and.

D

You can read all of the comments. It's quite big historic here.

B

And are you, are you using this? Are you using this introduction.

D

C

D

Actually, still running in test environments, we are slowly preparing to run in pre-production, uh but I think when it gets merged into master, we will decide to to run it in production.

B

ah So your so your volume selector resizes the levels to fit on the on the ssd.

D

D

During the evaluation of this pr.

D

We find another bug in blue fs, so I think our our team make quite a good job and we hope that the spillover problem will be gone forever. After that, oh nice, probably we we will make a blackboard to octopus and now also so maybe in next releases.

D

This option will be available.

B

Do do we need to redeploy the osd to use this dynamic levels.

D

B

D

D

The funny thing is that the.

D

Organization of data during upgrade went quite smoothly. We didn't notice any interruptions, so maybe in other cluster will be something, but in our test cluster we didn't sell so nothing. uh We will be seeing vapor production and the productions later, but we're waiting for the uh merge and then we we we did it to our production.

B

Can I ask a different question about bluestore there? There are several um uh allocator related crashes that people are reporting, especially with nvmes. Do you see, do you have any like allocator related crashes at your site? Do you have some kind of custom custom blue store configuration that you run or do you just use default 14 to 11 blue store configuration.

D

ah Let me see you.

B

Because we see none of these crashes, but every time someone is posting on these in tracker or on mailing lists, it looks very scary, like we have never had to run blue fs fsck on any of our clusters.

D

We had booster rugs db options.

D

To to feed the levels on production and enabled snappy compression, because we have a very long paths and there are good compressed.

D

But for the nothing more for booster.

B

Okay, actually I see your is that your con, your config file, was posted in one of the in the tracker you linked to earlier. I can see it there. Okay,.

B

D

Oh exactly it's from this production cluster.

B

uh One last question: maybe I can ask since since everyone's quiet um we run our rados gateways on some vms, with only 32 gigabytes of ram and sometimes if the cluster gets very slow or if a user sends a lot of requests. If users really hammer the router's gateways, the memory usage of the rouse gateway can increase to be a lot and we're trying to find ways to limit the route.

B

The memory usage, I think what happens is the router's gateway is like buffering the the it's it's it's allowing many concurrent ios and it's just creating memory buffers for all of the concurrent ios.

B

Does anyone have like some recommendation for limiting memory usage of rada's gateways.

D

Do you see this on a separate frontal tone or beast.

D

D

Maybe number of threads.

D

B

How much memory does your rattles gateways use.

D

uh We have others getting collocated with manager, monitor and nfs ganesha to serve data about probably.

D

It takes about 12., maybe I have a check.

B

How much memory you have on that whole machine for all those services? uh 256., okay, yeah! So you probably so.

A

We notice when it goes out of memory.

B

But yeah, I think, you'd never get that large.

D

uh 34 gigabytes of memory yeah with 512 threads.

B

D

Maybe limiting some connections on kernel level.

B

Will help how many yeah, how many routers gateways do you run for your large cluster.

D

Eight eight okay, but we did some tests uh in luminous long time ago, where we um connect five in 15 or 14 rather's gateways to the cluster to check. If we can download the data with the speed, 100 gigabytes from it and its scale is good.

B

Yeah yeah, we have 15 rattles gateways now, but they're, mostly they're they're, not very high. Like throughput, it's it's more like iops like they have git lab repository, git lab registry and some other things. We have small files on top.

B

But I think maybe if we have fewer routers gateways with more memory, we might have better experience. Do you have some load balancer on top, which is redirecting according to bucket name or something like that.

D

The biggest cluster infrastructure is quite specific because we before there are those gateways we have our proxy br3, which is translating paths from the user space to to save space names and before that blocker.

D

But so so you load balance. You have one load balancer before the 15 radius gateways.

B

No, we have, um we have say eight or nine, maybe 10, I don't know we have 15 total. We take randomly, we use rat, we use round robin dns with uh say 10 of them and then so then we have traffic, which is like h.a proxy. I don't know if you know chaffee it's like aj proxy listening on port 80 and 443, and then inside we route, we look at the bucket name, we route to specific uh routers gateways, depending on the bucket name, depending on like a regex of the of the bucket name.

B

So this way we have some users that do something crazy and we route them to special rattles gateways. In case they in case they do something really like dos or something the routers, their rados gateway will will be affected, but the other users won't be affected.

D

Oh interesting, surgeon.

A

A

A

I'll tend to forget that resources are finite, sometimes and will accidentally ddos or something.

C

Unintentionally,.

C

What are people's experiences of beast versus civet on.

C

B

I can't remember why we we were using beast even back when it was experimental. I can't remember the reason that we switched from civic web.

D

I heard people that beasts are faster.

C

I haven't tried beast: it's been on the list of things to try one day when we got time, but I didn't was curious to know whether what people had found.

D

There are another one: the performance tests from red hat and, as I remember correctly, there are.

D

Improvements in in in right uh to the cluster but varies are the same on the same level.

C

A

C

A random thing going down.

A

The radius gateway top is um for my infrastructure. We finally are upgrading to like some 40 and 100 gigabit switches, so I might end up going the raido's gateway now, instead of using that the led radios directly some one of the few. That actually does that.

A

Might be a smart move to do.

B

Finally, all right, I I missed, I missed the you said you want to go from because you use liberators directly now right.

A

Yeah and well, I finally got some new switch infrastructure. That's more 40, 100, gigabit switches, um because right now it's all just 10 gigabit for my networking, a part of the data center, and I always thought that would be a kind of a pain with radius gateway, because I would just need an insane amount of ports going to each host, um because that compute cluster will pull at. You know, 160 gigabits or whatever.

A

um That's why we stuck with the liberators, because you didn't have the bottleneck of the gateway in there right, yeah, yeah yeah, but now that we have the uh the infrastructure to support it. I might actually start going down that path of finally get over onto using radio's gateway and off of librido, so that I can finally do that boost door. Switch from file store because of the file size limit that we have that blue store has and now finally realize some of the benefits of it.

A

On my to-do list,.

B

Yeah, I think I think our our friends at palsy, super in australia are trying to do multi gigabytes, multi-tens of gigabytes per second through router's gateway as well on a new infrastructure that they're just there that they're, I think procuring. Now I don't know if I'm just checking who made it who's on the call yeah, it's not there.

B

There were also like back in in beijing saf day there were, I think um cloud was running a mini rados gateway on every client machine.

B

And connecting locally, but I think that they got advice that this isn't a good idea, because the rados gateways uh communicate with each other to for to deal with the bucket cash. Oh.

A

B

A

It's just an idea. I never would have considered something like that. But then again I don't want to have like a hundred radius gateways.

C

Not sure how the security model would work.

A

Yeah, very true yeah, if you have external users actually using that, instead of just a you know, straight up, hpc private cluster type of.

A

A

In the chat looks like andre is looking to grow stuff a bit or something there cool glad. You could join us.

B

The the the main place, so we also have netapp and ceph, and we use netapp for our oracle databases, so we haven't yet uh our db team has not yet had to the guts to move to cef. Yet for that versus ffs. But, to be honest, um I mean on the just keeping technical, I'm not sure how we would back up a cfs cluster that was running all of our oracles, because we have something like 10, petabytes of oracle or more now. This is that's.

B

That number is probably getting obsolete, but backing up 10 petabytes of cfs databases. I don't know how that would work with netapp. You have a nice snap, snap mirror and things like this.

A

Proprietary does have its advantages from time to time.

A

Anybody else have any uh fun stuff topics or I don't know any fun. Hpc topics in general.

B

Is anyone using the infinity band, networking and stuff.

B

Because our hpc guys are always asking if they can switch to infiniive and stuff, but I always say no: no, no, it's probably not that stable. Let's just stick with ethernet.

A

I looked into it once or twice yeah, but like it, this seems to be like what five users out there doing it and right now, supposedly you can do you know the infiniband over ip to make it work or whatever, as well and.

B

I just don't think that I don't think that the network is the is the bottleneck.

B

And I also, moreover, I don't think our user. I don't think our users are bottlenecked by cfs on ethernet either.

A

It's usually tuning that can be done elsewhere. It's a good improvement over the complications of adding.

A

C

We've got a whole bunch of melanox and finney band adapters, but we're running them all in ethernet mode.

C

I can actually see it see it. You can get I've seen about 99.2.

C

uh Through put on them, uh you know about 99.2 gig, three put on them. You know when we tried to stress them. That's fast enough.

A

Yeah, hopefully, is.

A

A

A

All right: well, thanks everybody for joining in good talk the next one I guess would be close to the end of january, whatever the the fourth wednesday of january will be I'll, send out the usual reminders on the set list and to the private email group as well, um otherwise enjoy the holidays, and I guess yeah. You continue to be safe.

B

Thanks for organizing, I think that's.

A

Of course, useful.

B

A

Yep, it's always nice to chat. Not we can't do conferences or anything like that anymore.

B

Maybe next year.

A

Vaccines, those problems they're, making progress, just gotta ramp it up and have enough willing participants.

A

Okay, all right see y'all later see you.

D