Ceph Crimson Weekly, 23 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Crimson/SeaStor OSD Weekly 2020-09-23

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

um So, let's start um last week, I as uh as always I've been working on the structure test fix and uh what did you show was uh the interval changes when, where we still have a pending request, waiting for a object context, so the the the request being executed, just bail out and keep retrying, but the the ones being pen, the pending ones, managed to to be scheduled before the interrupted one. So we have a out of order response, I'm trying to reproduce it.

A

uh Sorry, I'm trying to fix that by introducing a um a suspend resume semantic in the in the tri-mutex, so the interrupted interrupted one could have a token when it tried to resume the executing.

B

C

So, can you expand on that a little bit how's that supposed to.

B

Work, um why would.

C

What would it mean to interrupt something that currently has the tri-musics.

A

Because uh I think the pg interval changed so the yeah.

C

You have to start over from scratch, you can't start over in the middle, you have to restart it's not a resume, it's a re-queue and it's a new request.

A

Yes, by by by suspend and resume, I mean the one, the ongoing request cannot relinquish the the block. That's a that's the point. Oh.

C

I see so yeah so the um so I guess what you need is you need it to be the case that the interruptible thing, when it notices that, at least when the callback, when that request wakes up and observes that the interval has changed, it needs to drop the lock and re-cue itself right.

A

Yes, okay, I don't want to change the code a lot so so it's it's. It looks like it's trying to re acquire a lock from scratch, but actually you should always pass a token to the trimutex, which is that when it is, it is interrupted.

C

Okay is that to ensure that it gets it first, is that the idea.

A

Yes, actually, yes, I interrupted.

C

Everyone else, who's trying to get the lock on the mutex, should also fail. That is.

A

C

A

It will be appended to the waiting list.

C

I know that, but they should eventually acquire the lock observe that the state has changed, drop, the lock and re-cue, because when they started their operation, the pg epic was whatever it was, and it's not that.

B

C

I'd like to encourage you to re-look at that, because that's a simpler way to make that work.

C

For one thing, you've done stuff before you started to acquire the mutex, you read the object context off disk. You like interpreted the io after.

A

This whole thing in uh you know written in a repeated log, repeat: loop.

B

A

So I don't think.

C

I don't care that much about the way the code not so worried about. No, you can't just start over. That's the problem. You have to get back into the central weight queue because everything in front of you also had to restart and everything behind you.

C

If you simply try to reacquire the lock, I don't think you're ever going to make that that ordering work out.

C

Okay, so let's, let's think about the the the sequence of operations that happens to an I o. The first thing you do is you put yourself in a queue? Then you come off the queue. It's your turn to run. So you look at the message. You figure out what the object id is. You read that object, context off disk and then, which is like state, that's in memory that you're holding a pointer to then you it's a shared point or whatever.

C

Then you try to take the rw state lock on the object state or whatever this new tribe, mutex thing right. So for a read at least this sequence is pipelined. We might have multiple reads occurring on the same object and there could be like 10 reads on the very same object blocked waiting for right to complete.

C

So let's say that then the pg epic changes, um it's not enough for the right to that's currently in progress to be canceled. All of those reads have to be cancelled too.

A

Hold on, um we don't have the right request in the fly when we have read requests.

C

I know that because the reads are blocked on the tri-mutex: they started, they tried to get the mutex, they didn't get it they're blocked, but they are in progress they're, just not making they're just waiting. That's the entire point.

A

C

A

Right because they have already have the uh have active, pg and yeah.

C

It's an active pgp number.

A

C

Yeah, it's it's valid to operate you you, don't even know whether you have a write in progress until you pick up the object contacts to look at it. You have to have gotten that far first.

C

That is it's, it's perfectly fine to have reads and writes in progress at the same time on a pg just not on the same object. That's why the try new text is.

B

On the context.

C

Right so, when we're looking at a new request, that's just come off the queue. We don't even know whether it needs to block until we've looked at the object context, which means that op did some stuff before it blocked.

C

In particular, it has a reference to the object context, which is a shared mutex to actual in memory state. It has already performed checks uh up related to whether scrub is occurring. It has already looked at the current recovery state of that object and a bunch of other things.

C

So let's say, let's return to our original guess that there's one right in progress and 10 read: requests pending waiting for that right to complete.

B

C

So if, when the pg epic changes, it is not enough for the right to be cancelled, all the reads have to be cancelled too.

C

What's going to happen, is the right is going to get this interruption.

C

Error signal: whatever it's going to use that signal to drop its lock and re-queue itself, then the next read is going to get its lock observe that the pg epic has changed because it got the error signal, drop, its lock and re-queue itself and so on up the chain. Oh so they're all going to recue. It's not just the right.

A

All the pending requests should do for for waiting.

C

They absolutely must remember they have. They have pointers to the object, context and memory state that may not even be valid during the next uh epic.

C

They have to start over.

A

Let me let me let me be to be clear. If I start starting over you, you me they should check the pg status.

C

No, the p, no, no, no, no! No! No! No! The osd goes back in the, so we uh right now, um there's like so in classic osd, there's an actual queue right. What we do is we re-cue the op into the very central osdr processing pipeline, the message itself, the original message: we dropped all of the in-memory state for that rook for that request for crimson, it's a little more complicated because we don't have a queue like that. We have an implicit queue in the form of the pipelines.

C

So what you should probably do is drop absolutely all of the state associated with the pipeline and go back to the beginning. The very first thing where you pick up the pg lock and start from from there or where you pick up the pg and start from there. Remember after repeating state change, it might not even be the same pg.

A

Very moment we we can ensure that the read requests have already have the pg right, because right.

A

Should have the pg.

C

Yeah they have a pg, but you don't know if that's true after the the acting set change, you can't keep any in memory. State overtouch.

A

But uh in case of them uh acting saves their change. We can assume that the it's just acting says change. It does not imply that.

C

The primary change, though you would have to go through literally every check the primary does and make sure that it's invariant over an interval, change and they're, not after an interval change recently written objects will typically be uh degraded, that's normal, because the replicas may not have seen the most recent io, so the primer is going to have to re-recover the current state of that object back to them, which means rights have to block, which means those reads needed to block earlier in the chain they needed to block on wait for degraded object or whatever I'm telling you.

C

You don't want to play this this this game, you need to start the out processing pipeline over from the beginning.

C

Do not remember anything from a previous interval.

A

Okay, I tried it to be it's too.

C

Too many different ways: this can be difficult one day in the future. We might choose to be smarter about this, but yeah. It's not easy to do this correctly. It would be simpler and more effective and, in the common case, exactly as performance to simply re-queue, in that case, mostly you're, going to have to perform the same checks in the same order anyway. So this even.

A

Even though the pg, this primary is still the same primary.

C

A

Of all the things.

C

The primary is clearly the same, but think of all the things that aren't necessarily the the same. The object version could have changed. The current object context on disk could have changed. It doesn't seem like that's true, but it actually is because after appearing after appearing the pg, authoritative pg log may have changed.

C

That object may now be degraded and with an eraser coded pg, it will go backwards in time. I promise because we we roll back the unstable operations.

C

So, let's say after like when, when a pg, um when a pg interval change happens, there are 100 ios outstanding on a pga. The primary has persisted them. None of the replicas has seen them yet. So when the osd map goes out, um changing the acting set, but not the primary, the primary will be the only p osd with those 100 log entries.

C

So during peering we're going to look at we're going to compare those log entries to our peers with a replicated pg we're going to decide that all of those objects are degraded, because the primary has a copy that no one else has so before we can do anything to them. They have to be re-replicated over and you'll notice. The wait for degraded check actually does happen on a read.

C

Request can situationally happen on a read request, depending on whether it's rw and whether it's right ordered there is a lot of detail in liberators flags that change the way read and write tops are ordered. I really really really do not think it is a good idea to recheck all of these conditions again just to avoid what is frankly, not an expensive operation which only happens during peering anyway.

C

So, by the time we're even talking about this, we've gone to the trouble of doing two entire message round trips to the rest of the acting set, doing two entire commit cycles and re-upping all of the primary state just having to redo a couple of lines of code in the uh op processing pipeline is small potatoes.

C

So I think strongly that you should keep it as simple as possible when we go through an interval change. Every op goes all the way back to the beginning.

A

In the very beginning of the client requested handling.

C

As though it were a brand new message right off the wire, you just have to be very careful about the order in which you do that, because they need to go back into all of those pipeline states. In the same order. It's semantically identical to the order in which we call nqop during interval change in classic osd you'll notice, we're very careful about which queues we look at and in what order.

A

We have to have to have have to have a uh a queue.

C

We don't we, we have a queue already, that's what the pipeline status.

A

C

The but I mean, if you look at the implementation- that's literally what it is. It uses that um uh the c-star lock, which internally, is just a linked list of tasks.

A

That's all it is you just shared mutex. If I recall correctly.

C

Yeah, but in c-star it's just a linked list, there's nothing complicated going on there. It's a linked list of ordered waiters.

A

Okay, I'll check out it. Yes, that makes sense.

C

So you you have to you, have to re-cue them, which basically means restarting them and making sure they hit the first pipeline stage in this, in the reverse or in the order in which they need to be processed, which means you start with the ones that are furthest along and work backwards.

A

Yeah yeah make sure that the oldest one gets served as for the first one. First.

B

Yeah, that's right.

A

C

Yeah, like, like I said, um if you go, look at classic osd you'll notice, we're very careful about the order in which we call enqueue up on all the various weight. Cues you're doing the same thing.

A

Actually, I tried to avoid looking at it other classical stuff.

C

I I mean I know that, but that particular piece of code evolved to where it is because we spent like years I'm sort of learning a good kind of discipline to think about this without constantly introducing mistakes, and this is what it looks like.

A

Okay, let's play six in our in at least phrase and the camera later.

A

D

Yeah, um I'm still developing an um tree and the finished display increase and doing the make balance test case and for the initial error, and I made the problem on: don't have the crimson cluster, so I have already saw because who has already uh applied a pr file to solve the problem. I will try and try to see if it is fixed.

D

Okay, that's all.

A

Okay, thank you for filing the traffic ticket and I believe it's a regulation I introduced like cabo couple days ago, sam.

C

Yep I've been working on finishing up the pr for uh garbage collecting regard for basic garbage collection in c store. So at this point it can run it should be able to run indefinitely doing a random io workload without running out of space. So that's good um because it will. It will clean the used space behind itself right, so it should use about the amount of space that it's actually supposed to use and not just sort of all of it.

C

It's got a tunable that says like um you. Can it has two tunables one which is a target free space to maintain, so it won't allow like if it gets to that amount of free space, say 20 or something it will garbage collect aggressively to maintain that amount of space or prevent ios from happening. The other parameter is a ratio of used to unusable space.

C

So at any point in time we have segments that are available for writing, because they're empty segments that are unavailable writing because they have data in them and then individual bytes that are actually live. That is the reference they're real bytes.

C

So if we have um two megabytes of segments that are unavailable because we've written to them, but in of those two megabytes, only one megabyte are actually live, and this and the threshold is set to 50 it'll start garbage collecting that and even if the disk is relatively empty. This way we won't get to 20 available space and suddenly garbage collect the entire disk.

C

um So in practice this means that for the most part, we'll maintain only a fixed ratio over the real live usage of garbage space, but the disc should maybe should be mostly available for use.

C

It's just a way.

B

C

A big flurry of garbage collection at the end.

A

So, in other words, the other parameter is the ratio of the the clean the clean blocks versus the the 30 blocks for choosing the victim segment to be completely collected.

C

No, that's actually an entirely different first feature right now. It just chooses whichever one has the most data blocks. Now we can be smarter about that later. No, this is um like when an I o comes in. We have to decide how much garbage collection work to do right and if we only if, let's say it's a four terabyte disk, but only 100 gigs are in use.

C

Arguably we shouldn't do any gc at all, because the disk is mostly empty right, but if we only have 100 gigs used, but two terabytes unusable, because we've done two terabytes of rights, I would argue we should have started garbage collecting a long time ago.

C

So right now, with a 50 ratio on this parameter. If you have 100 gigs of genuinely used space, it'll start garbage collecting once there are 200 gigs of used segments, because about fifty percent of the of the garbage collected space will then be free.

C

But it's a pretty it anyway. Those are the two parameters that exist now, we'll probably want to make it smarter in the future, but the behavior seems to be deterministic, at least, which is nice.

B

Okay, thank you.

B

B

Last week I was debugging the interruptible future and uh uh I think I'll submit the pr uh as soon as this parts are delivered.

B

That's all for me. Thank you.

B

Yeah last week I have integrated gh objects and implement, read and write to the all node root address in systole and currently I'm right working at unit tests of random insert with both dummy and the sea store backhand and I'm fixing related bugs.

B

C

Thank you. uh How was the um review on the part one of the dirty extent right now going.

A

I think I can I can wrap it up by the end of this week, but I'm still struggling with it.

B

A

Probably I could grab you if you if there is some some pro questions later on.

C

Yeah, do you want to speak after this or it's up to you.

A

Later, probably later later this week, okay.

A

A

Nope, okay later and thank you, have a good one.

A