Ceph Performance Weekly, 10 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-09-10

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey sorry about that, I had uh firefox issues for some reason. It's not happy at the moment so anyway,.

A

Josh, you were looks like you were starting. What's uh what's up what.

B

Are we doing we're just getting started so we go, you can do this, go ahead.

A

Oh okay, um I'll I'll be frank. I don't have a whole lot here. I uh I only have about halfway through uh uh reviewing pull requests, so not a whole lot. um Josh. It looks like you closed you, you uh merged your off, monitor pr cool, uh that's good! um There are a couple of other updated ones um leaving. It sounds like maybe this osd async recovery mid cost. um Now the thought is that we might leave that default.

A

um Still a little bit of discussion on that one.

C

D

Last week we found out that some of the tests that were being done didn't have the right crash rules uh because of which the results may not be uh reliable. So we we need to wait on that. One uh until further results are ready.

D

A

Cool, uh let's see, um and then uh I think kifu is still looking at the pr for optimizing. The blue store.

A

Oh that was closed. Actually last week I missed that.

A

Okay but kifu reopened it okay, so yeah it looks like there hasn't been any real movement on this since last week. It was just closed and reopened, so all right, uh yeah. Otherwise, I'm not seeing a whole lot. um Were there any pr's this week that anyone had that should be on this list.

A

All right: well then, um let's move on uh josh to hear you were asking e word: did he have a status update for us.

B

Yeah, I thought we would make it could we start with that and then go into the discussion afterwards.

E

Yeah sounds good well, uh first of all, the introduction um we've got multiple uh complaints about pretty slow, full removal and surf, uh so it might take several hours to reclaim the space after pull removal, or sometimes I've heard about days.

E

So I did some investigation what's happening and what.

E

It behaves like this and what can we do to improve the situation.

E

So, first of all, let me share some information on how removal works at the moment. Can you see this slide.

E

F

E

So when a user.

E

Starts pull removal which runs in background.

E

Sd creates a bunch of tasks deleting tasks one per each placement group and each task works in multiple performs multiple iterations on each iteration. It.

E

Performs a bunch of operations on a limited amount of entries retrieved on the first on the third first sub-operation.

E

So this uh involves collection listing which is retrieving up to 30 entries for placement group, 30, 30 or notes well their names after that it attempts to remove this entry from snap mapper, which I'd prefer to leave aside for now, not a trivial operation as well. But for my experiments I didn't have any snapshots, so it was trivial one then, on this third step, sd invokes or node removal from bluestora, which in fact is a bunch of.

E

Other sub-operations, involving removing all the node on map, then reclaiming all the space at main device occupied by this node, which is updating blue storage locator and then removing all the metadata records for this specific node once completed for once completed this iteration for 30 entries. uh The task sleeps for some time, determined by osg, parameter osd delete slip which is different for depending on the setup.

E

It's it's five seconds for spinner. Only configs.

E

Two seconds for pinner, plus ssd and zero seconds for flash up, and actually that's the major thing which causes a pretty long execution. Imagine two seconds per each 30 entries. It takes pretty long time to to run through all the entries for say. One million objects in in the placement group.

E

Any questions so.

E

E

Yeah, can you still hear me.

B

Yeah, I think it's very clear.

E

Okay, um so well, actually there are a bunch of issues with this removal procedure.

E

The first one actually is the a bit ineffective collection listing which starts enumerating entries from scratch on each iteration.

E

We can use previous position instead and that's easy to implement and it brings some improvement, but it's well it's visible, but not for the whole procedure doesn't bring much benefit.

E

The second issue, as I've already told.

E

It's a pretty large period for sleeping and that's the major cause which.

E

Results in long running removal execution, but from the other side it looks like this is a trade-off to ensure regular operations running in parallel with full removal to work smoothly and operate. The same relatively same performance I'll show some numbers a bit later.

E

Well um then, the third issue.

E

Well, the third issue is that this single or not removal is pretty expensive. It involves enumerating of existing maps, which is again some seek operation and delete tabs. Multiple of them then update allocator and then finally update, remove or not metadata by records again, multiple of them.

E

And that's why we are unable to well. We can set this slip period to zero, but it greatly affects the regular operation performance even on on the setup, but well going to move into the fourth issue.

E

It looks pretty strange, but actually for flash configurations. We have this blue sleeping period, which is different from hybrid configs.

E

But in fact, since we are talking about database primary about database operations, these two setups should be already the same. In my opinion, so hybrid setup from database point of view is no different from all flash one and potentially we might have some hidden use, a full flash setup where pull removal might cause significant performance drop, since the default parameter is set to zero seconds.

E

And the fifth issue uh is about using um more advanced, more advanced techniques, to remove multiple records from roxdb. Actually, it has a delete range function which.

E

Might uh well, which is not ideal. To be honest, one side, it allows to to remove multiple sorted records in a single shot from another.

E

From another side it actually might cause, as we observed some performance drawback, some performance drop for subsequent read operations and, in fact, currently we have support of this separation, but we disabled it.

E

But maybe we should reconsider this and try to enable it for all this pull removal stuff uh I'll, show how we can update this removal procedure if fit from uh from deleted edge function from one side and from another side we wouldn't produce that much uh tom stones which create with functions in database since actually we'll need just three drumstone per each placement group to remove.

E

And the last thing uh about this uh delete range operation, uh which I discovered not mentioned here- is that at some point we introduced.

E

Before using this function, we introduced uh record counting procedure to estimate if we want to apply this function or not, which actually iterates of the bunch of records until the threshold is reached, and if we have enough records, then we proceed to delete range or otherwise we delete using regular, regular function, and it looks like this procedure is, might introduce additional overhead as well, which is not that small as I've seen and again, we might get rid of this stuff.

E

I mean record counting for um for new pull removal stuff and maybe uh after that, delete range. Wouldn't be this.

E

E

Well, in the next slide, I'm going to present, um the idea is how we can redesign how we can design this removal procedure. So any questions so far before proceeding to the next slide.

E

No everything is fine. Okay,.

E

So here is the new design which I'm proposing I have working pc for that and well I'll show some numbers if we'll have enough time for that a bit later, but uh well again, let me present this um this procedure for now, uh leaving aside snap matter stuff, um then what what's the main uh issue with uh slow pull removal for user perspective?

E

E

Well, most of users are uncomfortable with a pretty slow space reclaiming when pull is removing.

E

So what they expect is to issue the pull removal and then get some space back to to available as soon as possible, and currently we we don't provide that, and maybe.

E

We can change our procedure in a way that, instead of deleting nodes completely from database, we just reclaim some space for all nodes and then postpone all database removal to to tolerate a step.

E

In fact, writing to database is quite faster than removing data from it and and less intrusive in terms of performance and.

E

Probably we can benefit from this by doing space reclamation, but leaving objects empty and with all the maps for some time and which is step three on this scheme.

E

uh And once we reclaim all the space, we can proceed with removal. All the records from database here are two options.

E

Actually, the first one is to leave things, as is that which is uh iterate over every node again and remove them, as as we currently do, by removing uh their maps again listing over them and then removing all the metadata node metadata, but another option might be to benefit from from deleted range function.

E

And in fact, can issue just delete range operations on the placement group, the first one to delete all the regular nodes for each pg, which are sorted and hence apply this deleted range. The second one is to delete temporary or nodes which have a bit different naming and hence they form a different range.

E

The third one is to remove all the maps belonging to objects in this placement group. Potentially this is possible, but our current naming or map naming format doesn't provide the this ability we need. Actually, we need to change the naming scheme to include placement group or id in into the naming, and once we did once we do that we might.

E

We will have sorted maps grouped in group with placement group id as we do for or nodes metal data.

E

And actually I made a view c for that, as I said, and I'll show some numbers about that next step. But any questions so.

B

Far on the last.

F

Point you're talking.

B

About reorganizing the um maps, all right, all right isn't did we change them to be organized on appropriate basis.

E

uh Well right now: uh well, the last change from sage brings.

E

Bring brought the grouping by pull id, but it still lacks the placement group id.

B

I guess I'm wondering it's really talking about what deletion is it possible to.

E

Maybe but it it requires some additional notifications in the code since currently it operates on placement groups. That's the first issue and the second issue. Potentially, we might benefit from new removal for from this new removal procedure when we need to remove just a single pg when it rebalanced to two different hosts or something, and if this, if we use pool ideas, this wouldn't benefit from from this.

F

F

B

Yeah, that makes sense, so it's important to remember that this is not just for poor deletion, but also for removing when it is no longer necessary on a given lsd.

E

um Okay, um so well, the next stuff is the beach raw. So far, uh I didn't have enough time to um to arrange it properly. Sorry about that, but uh what I have so far so here are some numbers for, uh for benchmarking, uh the original delete versus some tuning on this delete sleep.

E

Time as well as for using this reclamation, reclaim space stuff as well as bulk removals.

E

What's the scenario for this benchmark, I took all flash single osd using this start with 32 placement group and then run rudders bench which fills this.

E

This pool with some data to some main data, as well as a map using block size at 4k.

E

If it worked for 1 000 seconds and produce around two million objects, then I do pull removal which starts to run in the ground and then proceed with another bench with similar settings and measure how how it is affected by the the removal and also measure how long it does it take to remove the original pool and, additionally, some of at the end or some latest scenarios of some later runs.

E

I tried to measure read performance as well as fs check time to run through the final store, which was an attempt to to estimate how these tombstones introduced by deleted range range, delete how these thumb stones affect the the hidden performance.

E

uh So the first column, which is b, is that our original delete procedure with delete.

E

Yeah with delete sleep set to 2.

E

And here we can well, and originally uh initial performance was.

E

Bandwidth was around 7.5 megabytes per second and on the second run, uh with pull removal running in parallel. We are getting around 6.8.

E

At the same time, the pull removal takes more than 1000 seconds, which is longer than the procedure to to fill this to fill. This pool.

E

Unfortunately, I didn't have.

E

I didn't have the corresponding performance counter on the first run, so I don't have.

E

I don't have the actual timing, but it is definitely longer than 1000 uh seconds.

E

uh Then the next column is about again about original delete procedure with sleep sleep period set to 50 milliseconds and we can observe a pretty visible performance drop for for regular ride procedure.

E

But at the same time we can see the improvement in in the removal time.

E

So right now it completes faster than the original writing procedure and this the next column about setting this this period to zero and in fact performance, is pretty the same and.

E

And removal time is even shorter, but in fact uh one um one hidden issue with these numbers is that they show average bandwidth. But in fact.

E

I'll try to show a bit later. Actual actual bandwidth is variant depending on. If removable removal is completed or not so in average we would. We are getting pretty good numbers for delete slip 0, but in fact we have pretty low bandwidth while removal is running which makes it a bit questionable if we want to use it.

E

Then the next, the next column is about.

E

An incomplete, uh a partial uh fix which introduces space reclaiming before the removal, but the removal itself is uh he is using still the same delete procedure.

E

Into the seek then delete- and here you can say you can see uh that actually in this reclined reclaim time row- we can see that we are able to complete a space reclamation much faster, comparing to to the removal of the the original pull removal, which is good for users, since they get their space much faster.

E

E

Some some hidden issues, issues with this procedure is that what's happening, while this reclam reclamation and removal is running.

E

Then column f is similar to the previous one, but I added the reading.

E

Reading from the pool from the second pool, after all this, these removals has completed and actually that's.

E

That's original uh performance read performance. We we are getting from from this pool, so there are no top stones at this point, and this number shows how fast we are at region without tom stones.

E

Then the next column, which is not very interesting, it's an updated, a fully updated removal procedure which performs space reclamation and then proceeds with final delete ranges on on this placement group. But since it it has two seconds huge, it's still slow uh because reclamation still is still operating. This 30 30 entries portions and then sleeps for two seconds. So it's just just for us to hear it just for reference.

E

More interesting uh is the next column which is h.

E

This is the same reclaim plus bulk removal and reclaim operates on using a 50 millisecond period.

E

You can see it's some degree faster than our original removal for for the same sleep period, uh we can see also that reclaims space reclamation takes 166 seconds and the total pull removal takes 169 uh seconds. So we just we need just three seconds to to to perform all the completion using this deleted range.

E

And the next column uh the same using sleep period pad to zero.

E

While pull removal runs faster, much faster here,.

E

Then the same column, the next column, using the same config and the third column again, the same config, and in this run I had this final reading.

E

Reading bench, benchmarking scenario, which is pretty in line with what we had in column f, which is original arduino performance.

E

So well, it looks like uh from reading point perspective. We are not affected much due by this deleted range. uh One questionable thing is the row 8, which is fs check running after the pull removal, and here I can see that it takes longer with all these changes.

E

Column g to k shows consistently show a longer times.

E

Well, what else I have is some diagrams or our this? Second writing runs over time in parallel with removal, and here the blue line refers to the original delete with two second two second period, uh you can see it it's pretty stable from one side, but it's a bit lower.

E

All the time and again uh we don't have pace totally freed of this period, so actually the completion of pull removal is somewhere out of this diagram.

E

Then, red and yellow lines refer to again original delete with a different.

E

Time period slip period and one can see that why, while uh while.

E

Tool is removing the performance of the right. Parallel right operation is pretty low.

E

And that's what may, what probably made the offers originally to have this this period to raise this period? That much.

E

And the green uh line here is again a regular delete, coupled with the pure, uh with the pure space exclamation.

E

uh Again, it's it's questionable. um So, overall it profi provides better performance than original delete and well.

E

Space is reclaimed much faster, but during this procedure the.

E

uh The regular operations are are affected negatively affected.

E

And here are a novice another set of diagrams, which again include original, delete blue lines and then to two new procedures uh to two new runs with all these new pull removal stuff, where red line is sleep period for reclaim set to zero.

E

And of the yellow line for reclaimed procedure, priority claims that to 50 milliseconds.

E

Well again, we regular right are still affected to some degree, while this removal takes place.

E

But it's it's not that much, but not still exist.

E

But then, potentially you might trade, you might try somehow a bit higher these higher periods to to verify. Maybe well. I expect the the drop the the right drop in rights.

E

E

Well, that's all what I have for today, thanks for listening any questions.

B

Thank you, yeah. That's really interesting. um I think a lot of the ideas you talked about which, in the uh slides made sense in terms of the way to restructure things.

B

um I wonder about in these graphs about the large dip that you see at the end of the new procedure, except when we're you're doing the bulk remove.

B

B

Like towards the end of where the lesion finishes for like the yellow and my red lines, there's a.

E

Yeah yeah yeah. um Indeed, I observed some.

E

E

Drops at this point and honestly, I don't have a good enough explanation. Well, originally, I realized that this record, counting in rem range keys, function, cost much of it, but.

E

Actually, I haven't managed to to remove it completely, so I I managed to reduce this drop, but uh it's it's still visible.

E

B

Yeah, imagine if you graph, like client latency rather than average throughput you'll, see a very high effect there.

B

Yeah I mean the restructuring of the deletes, makes sense in general. To me, I think the part that may be more questionable is the um remove range and the impact is having.

E

E

Yeah, I am it, I am not completely sure that it uh works.

E

We know all these bulk removals, but well uh potentially it looks uh much.

E

E

Much better so just run one operation and uh complete, but unfortunately roxdb doesn't perform it. Ideally, some.

E

I'm well actually, I'm going to to to to run some more experiments and have some some bad numbers, but so far yeah. I agree. It's a bit questionable.

F

Igor, I have a question: can I yep yeah sure what do you actually do to object if you do operation that you named reclaim.

E

Well, for now, I performed touch plus truncate well touch.

F

E

Actually redundant, it was introduced originally, when I didn't have any content in in.

F

E

The object and just did just had a maps. I realized that truncate is no op in this case, but but but finally I write, I do some rights, I do 4k right in each object. So truncate is enough.

F

Okay, so the reclaim step is just that you empty all objects of the data with standard uh ops.

E

Well, for this poc, it's a truncate for for production. We might need removing.

E

We might avoid on note recharging, since it triggers it might trigger a sharp removal. Actually, we don't need it, but for my bench markets since object, content is trivial. There are no shards and truncate is enough for me and it it doesn't suffer from from sharp removal.

F

B

Thanks, did you also measure the cpu utilization during these tests.

E

B

I guess another effect that when mark was testing remove range in the past, he saw drastically higher cpu utilization with remove range curious to see that's the case here or not.

E

E

Probably it's still there, but actually two points here. First of all, we have much less tombstones in. In this scenario, just order of placement number order of number of placement groups, and second, as I said there are there, are, there is record counting for current implementation.

E

Pre-Order range delete.

E

Column and now I am bypassing it, but actually it could cause some overhead as well. Well, it definitely costs some overhead.

B

Yeah, I agree that those changes definitely make it um more likely to to work, but I guess to do more testing and see.

E

Yeah, but I agree that some overhead at roxdb level might still exist.

A

The uh the big thing that uh it was the number of.

A

Of brain is blinking, um basically deleted keys, two cells. There we go number of tombstones that that really as they accumulated caused um excessive uh uh overhead.

A

So the amount of as soon as you you did a compaction. Everything got much better, but if you tried to do an rm range on on a significant number of keys that didn't result in compaction, you could see really really significant slowdown.

D

So uh mark uh remind me, uh the part about record counting only got added later in the rm range picture. Right in the initial implementation did not have any record counting.

D

Yeah, so we just added it because we knew that we had some performance results that showed that um without record, counting uh rm range was not doing so well. So we wanted to introduce record counting to introduce a threshold right.

A

Yeah, so if you're deleting a huge number of keys, the thought was that um at that point, maybe it's better, but I don't know that we really looked at what that value really. Is we just said it arbitrarily high and then said well for now: let's, let's avoid the the the really bad, behavior and um and then kind of punt on figuring out what it should really be set to so like right now, I think effectively. We just don't use it because the value's so high.

B

Does that make sense.

D

Yeah, I remember the part where we we just increased the threshold to a very high number so that it's not uh enabled by default, but I'm just trying to think that what uh igor is proposing is that if we remove the record counting all together, just for the pool uh deletion case, would we see the same performance impact that you had seen when you had run the tests on just the base implementation of rm range without the record counting.

A

I think it completely depends on whether or not a bunch of tombstones are left around.

D

A

Without a compaction performed, if we compact immediately after we do, the arm range, possibly breaking up the arm range into chunks and and then compacting in between, um we might be fine.

A

But I think that if we just leave it to its own devices, hoping that um we don't end up with a lot of tombstones left laying around, we may end up in the situation we were in before possibly.

A

At the very least, we should be careful when we do.

A

D

Yeah but it does sound like even if you leave the arm range part out of the other proposals that you got made in one of the slides. um Those can incrementally be um made right, like the delete sleep thing that he proposed also about the collection listing. So I mean I, I guess, there's a lot of things that we can start working on immediately.

A

Definitely yeah yeah. um The arm range thing is just you know, one one small piece and I there are almost certainly ways that we can work around it. I mean it might even be good enough just to trigger compaction as soon as it's done.

E

Yeah, so I'm back my notebook lost all his power, so yeah. One of the suggestion is to try some compaction after the removal, indeed and check if it helps to improve the things with all these tombstones.

A

Igor you, you could also try to, in addition to doing at the end, possibly breaking the arm range into discrete chunks and compacting in between doing like a hundred or a thousand at a time with a compaction.

A

You know between them.

E

Yeah makes sense, but honestly, I have never tried this manual compaction uh from from before. Well, I I I don't understand how it behaves from performance point of view how uh how it's applicable uh in parallel with the regular uh on this store but uh yeah. I will try.

A

Yeah, I don't. I don't know that any of us do um this. Is this problem isn't just um one that we face if you go out and look at the rocks tv archives for like the mailing list or whatever um there's well, you can't really look at the facebook one, but the there's, like a google group thing that you can look at there's, there's other people that have hit issues like this before. It's not super.

A

E

E

So any of the questions.

F

F

G

I don't have a really a question whether a request igor, is it possible to share the um the slides and the results.

E

Yeah sure I'll do one once powers on you again.

G

A

Thanks so uh we pretty much use up the whole time guys, which is good as always good, to have things to discuss like this. um Should we schedule next week to continue the oh, no discussion and move the paper off another week or do people want to talk about the paper.

G

um I I read the paper and personally I don't. I don't feel like it's that urgent I mean we can, uh I think, continue the discussion of oh no next week um and postponed the discussion of the paper, but.

A

Okay, I'm okay with that. If, unless anyone else has a strong opinion sounds good to me, all right sounds good, then so next week we'll continue the oh no discussion, um igor! Thank you very much for your presentation. It was really interesting.

A

um I'm I'm really curious to see what you uncover, as you continue to work on this and uh all right. Well, then, uh see you guys. Next week have a great week. Everyone.

D

Yeah thanks thanks.

G