Ceph Crimson Weekly, 31 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Crimson 2021-03-31

Description

https://ceph.io/community/meetings/#crimson

A

Last week, I've been occupied by the uh deal occupied by the um organic to focus migration.

A

So that's it and come in.

B

um I do uh I can do the mbd testing uh used uh sam's new pr, but it still has some problem.

B

Sometimes I can finish the testing and sometimes they're still there as a segmentation fault, but I can't find any any laws uh 1 and will trigger the segmentation fault. Sometimes it repeatedly happened. Sometimes it will not happen. Even I run the test over one hour, so I still try to figure out how to trigger it. In what condition I can trigger it and try to trace it and another one is read the sams uh segmentation, clear, sigma, clear code and journal code and a little issue back and I will fix it tomorrow.

B

A

B

Result I paste the link in the in the chat window.

B

So sometimes it can finish, maybe until repeatedly many times, there's no error, but sometimes it's repeatedly to happen. The segmentation fault and when I set the offsite to be a larger, um for example, about the 50 gigabytes fl cannot run so I'm not sure if it is fl issue, all the other.

B

Our code issue so still need to make.

B

C

You'd want to debug it and identify the problem, but for the sake fault, I can't currently reproduce it. So until I have more time to work on that, I'm just not going to yeah. I can't defend any.

B

Any certain condition can trigger agent, but so yeah.

C

I mean it's pretty clearly a timing bug like it's real, I just don't have a way to reproduce it right now. As for the other thing, um I I'd ask you to identify why it's happening.

C

That fio does not work if you set. I think this, the ios.

B

Yes, the offsite, I created many jobs and each one set the different offsite and if I started the outside, a bigger size, uh number 50 gigabits and so the fl stopped there. But no ever happened in the in the nbd side, just at a fl stop there. So.

C

B

C

It's using the normal nbd fio back end. So if the ios just aren't happening like you set the size to be big enough right.

B

C

Offset sure, but these the size of the device itself is big enough to support all those offsets right.

B

C

Which is bigger than the offsets you're setting.

B

C

Yeah, I'm not sure you have to look at the fao documentation. There are ways to increase fio debugging, which I'd also recommend the fao source code. Isn't that difficult to read so that's another option.

B

Okay, thank you.

B

I just want to test if I re increase the segment used and if the gc will influence the bio performance. So I want to write most of the disk and make sure they do many gc work and to see if the performance is getting worse. So.

C

B

Yeah, I just won't do that, but it's time to do that. That seems currently. Okay.

B

Yeah, that's alcohol.

A

I think it may oh no.

D

Hello uh last week or the past week, I picked back up the document you mentioned last uh last week, kefu uh regarding documenting recovery. I'm still working on that. There are a few um questions that sam helped me answer, um so hopefully I'll be done with it uh by next. My next meeting, and also I synced up with the.

D

The discussion on on refactoring with either unique ptr or foreign ptr connection, um so I was decided for now- is that I'll continue uh doing the refactor with the unique btr, but keep in mind that we might need to change that to foreign ptr later um so. The way I'm implementing it now is should be pretty easy to um to transition to to foreign ptr, maybe with like some sort of a wrapper, but it definitely wouldn't require all the all the changes that I'm doing now to do them again.

A

Trying to do some wraps the unique pizza with some rubber, so we can change the rubber instead of what I don't know, how, like you, are doing.

D

uh Yeah either that or just yeah or just change the the uh the uh the type def.

A

Yes, that's a lot.

D

Better, oh yeah should be pretty simple, um yeah.

E

That sounds great.

D

And yeah, that's it's for me!.

F

Hi everybody um I was working on uh introducing to classical sd the changes I made to describe a state machine so that it would match what I did in queensland.

F

I had what one question that sam just helped me understand in the last hour and it's almost ready for sopr and that's it more or less. I I plan to spend today tomorrow reading what eradic did on the transactions in queensland and so that I'd be able to use it. That's it.

A

C

Yep, I'm working on the logic for um extent still. I came up with a scheme for embedding all of the relevant allocation information into the lba layer and avoiding the extent uh map entirely um so I'm working on implementing that it's a little bit more complicated, but it eliminates an entire second mapping. So it's definitely worth it.

C

A

And I met your first first words: you're working on logic for extend excel.

C

So originally the idea was chad may created a an extent map um implementation that we were going to use as a mapping from optic addresses to logical addresses within.

C

Within c store, but as I was input or as I was integrating it, I realized that I can actually embed the only extra piece of information I need into the lba layer itself, so I'm cutting out the extent map. I think sorry about that. John may. I didn't anticipate this before.

B

C

At least that's what I'm working on we'll see how it works out. I may still need it. The main sticking point was figuring out how to handle clones, but I think I'm going to add what is essentially an indirection into the lba map itself. That is a logical address inside of the lba b3 manager itself. The entry, instead of pointing at a physical address, might point at another logical address: that's eliminating the need for a second mapping.

C

It will make more sense when I finish it, but that's what I'm working on it's changed up. What I was planning to do a little bit, so I'm working on it. I think that's it.

A

Okay, I see now thank you.

G

Last week I implemented the uh the get and set address method in c store right now, I'm trying to implement the read and write meta meta methods in systore and I'm also looking into uh the marx problem. That might be my call might be caused by uh interoperable future and the sister thread. uh That's all for me.

A

Just a little more background, marco is obviously reserving some some memory leak and he tried to bisect the offending a commit and he found that it's a it was the um entire future who introduced these regressions. That is looking into it and.

G

E

G

That's the that's the case.

E

um For oh no three parts last week I fixed minor bug and to and to implement uh all node erase feature. I think I have figured out a way to manage tree level. Enviro invariants across recursive, merge and split.

E

Currently, I'm working on the last piece to apply children merge in parent node and after that I will move forward to layout and the stage level implementations.

E

And just some progress, I noticed in c-star native stack. I pasted here.

E

Since there are some efforts to improve system native stack, majorly about zero copy pass and some bug fixes of the native stack.

E

That's all for me.

A

Okay, that'll be equal to good reading.

A

One more thing I just want to mention that I managed to fix the.

A

Never never never complete unit test we have been seeing over the over the last couple months is that's because the jenkins, because when we push another chain to appear, they can get about the running task. Sometimes it left us with some some some running unit test and they did not get chance to be to be finished by by the parent process. So I added a step to abort to to kill them. When the when the checking job is aborted, it seems to work.

A

That's it anything.

A

A

Nope, by the way we will have a cts next week. Yes, so feel free to update to add your interesting topic in the in in the past.

D

uh Assembly, you want to stay in this link, or do we go back to.

C

This works for me other people. They want to hang out.

A

That's the one.

C

Okay, just trying to just call us.

A

If the cts link is that wrong? Yes, you choose.

A

A

This is a cts link.

C

Cool everyone, do you want to ask you said you had a field of questions.

D

Yeah so so you mentioned we have um so we have the the log of the primary of the replicas and we have some sort of an authoritative log that.

C

Yeah, so for that I'm gonna. So have you personally read the raido's paper.

D

uh No, I haven't.

C

Read the raido's paper and we'll discuss it next next week, so the the the one of the things that was originally interesting about stuff is the process by which we obtain an authoritative log. It's called peering and that state machine and peering state.h is the thing. But the best explanation of it honestly is the original paper.

D

Okay, I'll go I'll, go.

C

Okay, I've got a link for you here we or um go don't have to wait till next week either. If you want to ping me, what time is there? Anyone by the way, I actually have no idea?

C

Are you in europe.

D

What time zone I want I'm on israel, so it's um utc plus yeah, so it's it's 8 15 a.m. Right now, for me.

C

D

Yeah you're deceiving us too thanks ronan yeah, uh so the second, the second question.

F

I'm not sure now that we started the saving time. It's a good question.

C

uh The honestly, the easiest way to do it would be to send me a calendar link or something we could negotiate that way.

D

C

Google calendar would figure it out or we could wait till next week. We've got the crimson stand up and the morning version of the q a my morning, your eve evening, I guess yeah. I actually don't know.

D

um The the second question was so in in pg in in long beach recovery, uh so we're going uh over the missings right and we talked about how um well in our urgent recovery we we it's force, we cover a specific specific object um and in long base recovery. My question is: is it possible that, while we're recovering from the missing set, there will be some sort of interruption or is it granted that we complete all of recovering all of the objects.

C

Oh well, okay, so there are a couple of obvious interruptions. One of them is we can't recover one of the objects.

C

So if there's something in the unfound set, um I forget what the code does, but I think it just exits out of the log base recovery process and puts itself in you know, state and found or whatever.

A

C

General, though we will just run until we recover everything.

D

C

The notion being that, by the time we've bothered to set up the reservation state for a particular pg- it's not just get it done, go all the way. Through that way, we can release the reservation. Let another pg do its do its thing.

D

All right sounds good yeah. Those were my questions for now.

C

um I can quickly sketch the peering thing just to give you some structure is anyone else interested in yeah? Well, maybe so at a super duper high level. The do you have a sense of what I mean by authoritative log, that's sort of the.

D

um Well, yeah, so I'm assuming it always like it's always correct, like it maintains the the you know, the most recent or the most updated versions of the objects.

C

Most recent updated is where we get the trouble, so as an illustrating example. Let's say they call there's exactly one client and exactly one placement group. The client has submitted 10 ios on the same object um because of the way liberators works. These writes that the rights that it's submitted will all occur in order and it will increment that object's version by one h.

C

So let's say we then cut power to the primary before any of the requests have gotten back to the client.

C

What are the valid versions of the authoritative log that could uh result or what would what would make an authoritative logo wrong?.

D

Neither of them got back to the to the client, not.

C

C

So the client sent 10 ios, but it didn't get a commit on any of them. What.

A

C

States of the log could then have could could be correct.

D

Well, if none of them uh committed, then, if we, um if we, if we, if we cancel all the ios, then none of them, oh.

C

We didn't cancel them. The primary crashed.

D

Oh right, but then we would want to like re. We would want to restart them now.

C

The the osd yes, but when the osd comes back up the very second it comes up. The first read it it serves, might not be from that client. It might be from a different, it might be from a different client. So what possible states of the pg are valid or to put it another way? What do we need to do to not break the rules?

C

What can we do without breaking the.

D

Rules um we can serve them again and then grants that you know we will make sure that that uh those rights are committed right.

C

But we don't so for one thing they may not have committed. We turned the osd off, we don't know um so from our point of from our point of view, from the client's point of view or uh setting up this scenario, it's possible that all 10 of the ios never got to the osd they're in a router somewhere or a switch somewhere in transit that the packets themselves literally never got there or it's possible that all ten committed and the replies didn't get back to the client or any intermediate state between those two.

C

All of those are possible note. In the second case, the osd has no way of knowing which of those 10. The client saw commits for as far as it's concerned, they all committed, and it can't prove otherwise.

C

So the answer is that, from as long as the client never saw any of the uh responses it is, it would be valid for the log to contain none of those rights. It would be valid for it to gain all of those rights. It would be valid for it to contain the first five, the first six or the first seven, but it would not be valid for it to have rights two and six, but none of the others, because they're submitted in order right yep.

C

So with that sort of intuition what we mean by an authoritative log is we have every right. A client could provably have seen committed.

C

The osds have a slightly stronger constraint because they have no way of knowing whether the os, whether a client actually saw the reply, which means that any right, an osd sent a commit for, must never be forgotten in the in the future.

C

So in general, the way we do this is, uh let's make a few simplifying assumptions about the way osd maps work. For one thing, let's assume we have all of the osd maps back to the beginning of time.

C

So when let's say we receive a new osd map that says iosd10 and now the primary for pg 1.12, what I'm going to do is I'm going to go back through the set of osd maps back to the beginning of time and find every acting set. Osd member that has ever been in this set. I am then going to query all of them as long as I receive a log back from at least one acting set member of every interval, that is every contiguous sequence of epochs, where the acting set was this the same.

C

I must necessarily have at least every io, an osd sent back to a client, because we write to all replicas and get a commit back before we send a commit to clients.

C

So the super duper high level intuition is the process of hearing is the process by which we make sure we contact at least one member of every acting set that could matter.

C

The rest of the paper is basically the processes by which we cut that set down, because obviously we don't want to keep all the ost maps back to the beginning of time.

C

Does that give you at least some intuition.

D

Yep but I'll definitely really read it into the paper for more detail.

C

And we can discuss it more next week, I'll read it myself. I haven't done a couple of years.

C

Any other questions for anyone. Yes, over.

E

C

A

Talk to you later, thank you. Talk later, have a good one. All.

D

Right guys, thanks.

A