Ceph Crimson Weekly, 11 Feb 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-02-11 :: Ceph Crimson Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Hey good morning.

A

B

How is everybody doing.

A

Well, I got stuck in classical OSD for a while, but getting back to crimson.

A

By the way when, when taking.

C

A

On classical is the OSD map handling path, I spotted that we are cashing actually both always the map, because it was the map and the purple released, containing crow or is the map data? What is the reason for having actually two level of caches even destroy that the X, the fifth level in in object, start.

B

You should know answer that. Maybe we should image.

B

D

Sorry I thought I was talking. Can you hear me.

B

Yep. Thank you. There.

D

We go, that's it like the whole thing, but apparently I was muted, um I'm, pretty sure the buffer lists are incrementals. You can check me on that, but the answer is that if you, if OSD 10, is on map 120 and he gets a message from OSD 5 who's on map 115, this way it just- you can just go back in the incremental bufferless and just send to those 5 encoded OSD map incrementals. Instead of sending the ha I'm like 80% sure, that's that's what it is lightly.

A

E

Brera to restart, and that's which I was a reading and stem code to implement in the store and also reviewing in Kings code, eating's pr2 to editor support to to ping a circuit to a given given core, which means we we will have to to route the message to using the Kasbah if we are going to going with um to a model.

E

Currently we are using one to one model, and that's me: I will continue looking at the sister invitation and try to come up with a note under the map. Information for.

D

What it's worth I haven't changed the transaction manager component in a little bit. So that's the component you'd be using to implement the Oh note level. Okay,.

C

Does that make.

D

Sense um but obviously nothing below that layer, layer actually works. So if you wanted to test your implementation, we'd have to talk about what it's supposed to do and then probably build in memory mock for it. um It's up to you, though. It'll it'll get rapidly more complete over the next couple of weeks. I think.

E

The memory inner start/stop information is implemented as environmental and transaction and an environmental European node manager right.

D

I'm sorry, what's up I.

E

Think that the in the memories back back back end or isn't implemented area in ferment. Oh, the the manager started with the infrared and prefix my antenna, correct, I.

D

Missed some of the middle of that something sorry I think you just cut up a little bit.

E

Okay, okay, okay, yeah.

F

Last week, I fixed the transaction sequence, master app issue and currently could guarantee the sequence and my old version it works and will not replace to the current register. Master and I found another issue when the plan to request.

F

This structure is called, it seems some buffer is Friday in two places, so I'm going to tip it back on the pad for free, it seems the the back row frayed first in the mos GOP and OS GOP, then Friday in the message so I'm trying to figure out where the buffer, it's shared, all the what what the root cause is, but it currently not fix it.

E

Thanks so have you have you picked the the empty empty transaction and non-existing.

F

Protection Sam said it is wise. It's our aim to do later. I'm a transaction well.

D

You're one of the cases it's because the interface that we created for what's a call, whatever the object store, we interface with created for crimson, doesn't have the flush mechanisms that the one classic has, which is fine, so instead I just sent an empty transaction with the blush callback on it, which is semantically identical. If you think about it right, because you can't complete a transaction before the other transactions, the same sequencer complete, so sending an empty transaction on the same sequence or what the callback you want is the same thing as a flush.

D

F

So the missing transaction is that the sequence is the later cami and the first to come out in the keep his arm ages. So I add a sister, a mutex there in the to guarantee the first guinea first route. So then it works, but when every it to the cat. Ladies, the Russian master branch, I founded the buffer Frey, has some problem.

F

It reported that keep that used, frayed hip the hip used after free, so I traced the code, and they found that in the in the plan to request a two ways and some profit freighter in the mos GOP in the OSD Opie, because MRO P include a spill Opie and the same buffer in the OSI LP, then also Fred in the mmm m OS.

F

The Opie messenger message message: yeah, so I'm tracing the Codel want to figure out why the bathroom ready in two places now that the sole buffer may be a pass of the buffer. Some address within the straighter address, though I'm still traced a call that you want to figure out her.

F

Don't any question.

E

E

B

A

I was too focused muscle almost exclusively on qussuk USD Thomas working can memory corruption back biology, it's that we have a synchronization issues with OSD, or is the map field? It's a short pointer accessed by multiple afraid, without very likely without any synchronization I made a fix and going through total ology testing to verify whether this was the root cause or not.

A

I'm getting back to crimson I resent the the reworked watch, not if I think quite soon. That's me.

E

Okay, Verna.

B

Everybody's I was working some of the time on the ice, okay working with Rick. If one fixing the issues in basic implementation, when you say separate a commit, you meant a committee of PR commit.

E

What, when I think I mean I wasn't main coming really.

D

E

Traditionally,.

D

You just do that to make it easier to like separate out things that are changes in what area from things that depend on them. It just makes it easier to review them. It's a pretty powerful part of the review process. To be honest, Ron.

B

B

The rest of the time I.

B

Mean code and have some questions and things I don't know.

B

Everything it's better idea.

D

Okay, if you want to create a Google Doc, you can put your questions in there and then we can do the sort of chap thing on the side. um That way. Other people can see the question answered process and we can kind of turn it into a bit of documentation as we go. It doesn't have to be a lot you just not like we're trying to write a lot down. It's just a way to get a bunch of people to. There is just a way to make it easy for people to track the progress.

D

The conversation, that's all up to you. If he mails easier, you could do that.

B

D

That's the thing: that's true of a lot of people, I think I think you'd be surprised I, in other words, I think your basic questions will be similar to other people's. That's all.

B

I can see thank you to.

D

Be clear you don't actually have to create like an explanation doc, just like just a list of questions would be fine. That way. I can answer directly and that's all sure.

B

Understood so that's that's it for me,.

E

B

Last week, I was implementing the PG lock and machinery in the Crimson's and Crimson's right path and I I submitted the PR last a yesterday and there seems to be some performance degradation in my PR, so I'm trying to think about now.

B

D

Just um one suggestion uh you might want to simply try commenting out the part where you add the PG log entries to the message and see what that does to the overall time. um My.

C

D

My wild guess is that that probably did significantly increase the encode time. So, if that's the problem, then we need to work on encode efficiency for the log entries, but it also might not be so just a thought.

B

Have you provided the code or.

D

That yeah, the profiler I just tell you.

B

B

E

The code I also suggest you to to try to rerun the proof test, because the current recently the perfect test not quite stable, for, for example, if we add some some non function, some code England in in the past, which is not critical to the I/o. That also could fill the protest so normally I, just rerun the perfect tester and received make difference.

C

C

Hello yeah last week, I am mainly address reviews and do the cleanup I tried to understand standard error code machinery in our code and and figure out a little more fix to my PR and I'm, going to figure out how to enable two zero copy in native stack and to get some performance numbers.

C

And that's all for me.

C

Just to try to see what we can get if we can get better performance to remove another copy in the native stack, because currently, our current native stack performance numbers included copies in both read and write to the map to the network layer. So I'm trying to see how better we can get. If we disable that that two copies.

E

One more question: I'm curious: if Easter the forum is on our side or on the street outside I.

B

C

It's I think the reason is if we, if, if native stag, doesn't enable iommu, if the device is not directly mapped to do the memory, it will introduce our memory copy there and it is a default configuration in native start and what I need to do is to try to enable the TFIOS iommu map the device memory there actually to user space in the configuration.

C

E

E

Anything else before that I also have a question: what what's the different difference between the pending and the dirty regarding to the.

E

D

So that's a concept: that's very much in flush right now, so I'm trying okay. So at a 10,000 foot view the objects drawer supports transactions right yeah.

B

D

There are two problems with that: one is that we obviously need to commit all of the rights associated with a transaction atomically. So during a replay we can't replay half a transaction we to replay the whole thing, um but the second part is that we are we interleave reads and writes especially I'm. Imagine updating a whole bunch of different no map values in an rgw bucket index, which is a real thing. That really happens that might involve a bunch of temporarily separated reads across a bunch of B tree pages.

D

That is, we have to like read a b-tree page, then traverse back up to a parent read a different B tree page and so on, because the insertions were doing or at like really separated portions of the right.

D

So the consequence of that is that, during the process of doing that, some or many of those pages may have changed.

D

It's a classic concurrency problem and it's true in crimson, even though we don't have multiple threats and the reason is that what we need to send the disk a lot of reads in parallel and in the common case you don't really have reads touching the same petri pages right, because the OSD already prevented concurrent writes on the same object. So it won't normally happen. But if you caused enough merges or splits for a table up the tree hypothetically, it could because everyone has the same root right.

D

So there are a bunch of different strategies.

E

How we handle the EBP, we poke it about in these. It really doesn't.

E

D

Don't really like it matters, it matters in the details, but you don't actually know whether you're going to need to do um a splitter emergency. You get to the root like there. There are ways in which you might, but in general,.

E

I kind of don't want to do it when we don't do the sort.

D

E

D

I'm not entirely sure that actually avoids the problem, um because you may already have performed to read at a leaf because you don't necessarily know which insertions you're doing at what stage like there. There are way. There are many ways to slice this problem, but this is the fundamental problem.

D

Many leaves share the same pup root, so we want to most of the time we want to read those things, but we don't want to take exclusive ownership of them, because other transactions might also want to read them so anyway, that's a long-winded way of saying that what I'm doing a mutation on blocks right now at least I'm copying them in in memory applying the mutation and not putting it back in the cache.

D

E

We have a right, okay,.

D

E

D

Look mighty, it means well, it's not a lock. It means my truck this pending transaction. We're doing right now has own copy of this physical block. um When we go to commit the transaction, we'll do a bunch of things. One of them is.

D

We will actually write the block out to disk right, um but we'll also insert the new leech or the Delta rather for the block out to disk, but will also insert the block back into the cache and while it's just ours, it'll be in state pending, which means that if we try to mutate this block again in the same transaction, we won't do a mutation we'll just directly change it, because we already have a copy um when we put it into the cache it'll be a new state.

D

That's that means there's a commit pending for this, but I don't have the physical addresses for some of the references yet because I don't know where that record wound up on disk I have to wait for the transaction to finish so during that time, anyone new trying to read it will block until the thing shows up and by that I mean you know, the future will fail to return.

E

Yeah, we need to find another like a new home before it not.

D

Necessarily no, no, not necessarily, we might just be writing a delta down. So if I'm changing block address, 20,000 I I could write a new copy of it out and then change everything that points to it or I could just write a delta that says. 20,000 now has a different value for whatever b-tree value. Key right, which means I, have a dirty copy in memory, but that value that I, that now it has a memory, will be wrong.

D

It won't be correct until the transaction completes and I know what address at what and that's actually what I was doing this week, um I was having a conversation like my up.

D

Until last week, I had been assuming that I'd be able to predict where a journal append would show up if I controlled, which open treant transactions existed, um but a converse for one thing that turned out to be shockingly hard to do um from a concurrency standpoint, um and the other problem is that I was talking to him young on at Samsung, and he was saying that really for their drives. At least you really want to use the append primitive, um but it's.

E

It's really hard to say right so.

D

You so I, don't know the address when I'm sending a thing down to disk the way I'm fixing. That is that. So, if you think about it, there are like two basic kinds of things: I could be sending to desk. There are things that come from above the transaction manager layer. Those things do not point to actual physical addresses, so those are fine. They don't care where the I don't care where they go, because nothing else can possibly know what their physical address is anyway.

D

um But the internal b-tree notes for the LBA tree are physically at rest, so I care very much what their internal pointers are. So the way this works is any b-tree note that gets written to a record will have two kinds of values: it will have values that are absolute, that point to physical addresses that are already stable, or they will have values that are relative, then point to a relative offset from that block.

D

E

D

It's just okay, that's just the way physical address right. That's the way the interface works through zns drive, sometimes.

E

D

Other words they were like. Yes, you might be able to predict it, but please don't. In other words, they can't promise that the addresses will work that way, so we shouldn't assume that they will right.

D

In other words, I, but I can't assume that if I send down a make of data will be relative, offsets any bytes with them. That megabyte of data will be the same. So if I send down ten blocks that are for KH- and I say- and I know that the fourth block refers to the fifth block- I can just let the value in that block be one or the next block or 4k whatever right.

D

So that's that's why the cache it keeps getting more complicated, I'm trying to give it enough states to express this is kind of what I was saying earlier with white I, didn't I, didn't think the B tree for the physical layer would look much like the B tree for the logical layer. The problem space is just it's like it's a giant headache. um You can kind of see it the design when.

E

It come to be yes just.

D

Because of the way the physical addresses work um does that make sense to anyone else or just didn't, want me to go over something differently. I will commit this to a document of some sort once I have at least three things that all make sense together, hopefully the next week or two, um and that basically means I'm gonna write the Albie a tree implementation in terms of the cash interface, and that should tell me what the cash has to do and once those two things agree. I think everything else will also agree.

D

At least I think that that's what I'm going for anyway, okay.

E

Okay, anyway, I'll move over to the recovery and the Cabana and I on the week, I mean.

D

You you could also start working if the transaction manager interface I'm, not saying it's impossible, I'm just saying: there's a non-trivial chance that I'm gonna notice that I did something that is in not possible or not a good idea. um So I guess it's risky! Yes,.

E

Because it's like 10 yeah, yeah.

D

I think it'll be a lot firmer in a few weeks.

E

By the way, I I would suggest you use rebate on the little master, because it's a it's a little comparison to boxing to waste variable with tuple and later I.

D

Actually decided at some point that I like doing it that way, I was like I, I almost went back to and fix it, but it was doing type inference better. In a few quarter cases it was more reliably inferring. The types um I may.

F

Just switch like I.

E

D

Appreciate your patch, though, it's anyway.

E

D

I was fighting about a few cases there, like a few things. You can do that are a little ugly like when you're looking at the code, but it just flat-out prevents dirt and type inference explosions like any any time. You have two returns in a lambda, specifying the return type fixes it.

E

D

Right, but if you have two different things and you need one of them to be auto converted the the compiler has no way of knowing that, so it's not gonna be able to unify it and the resulting error, messages.

A

If you are going to profile the PG, lock and right path in general I, maybe we should consider measuring the input buffer factory I mean that currently well profiling. Currently, we want to give much because the entire path is it's logged by is hit by the huge number of Siskel's.

A

We could actually, we have the plc of input buffer factory, which is uncommon for sister and one commit for for crimson. Maybe we should, maybe, if, if you want to focus on the performance of things like digital from the variable from the very ground up, maybe we should consider. Maybe we should merge the things without waiting to apps to commit to commit the modification to commit there into the buffer Factory things to upstream.

E

Rigor is regarding two innings efforts to to to provide the know.

A

The PG, lock and and right path, I mean.

A

D

I I think they're independent, except for the fact that they both involve profiling. um I think we're gonna be profiling, a lot of stuff over time. I I really don't want to block the PG log stuff at all. I think the patch looks about right, so I kind of just want to fear it. Why it's slow.

C

Though why it doesn't have doesn't accurate if we just compare the right pass, is original right past and the past with G log.

D

Oh I think it is correct. um Iii don't know if the exact numbers are accurate, but it doesn't really shock me that it's significantly slower there are a lot of opportunities that were inheriting from classic OST to burn some CPU, particularly when we're encoding the PG log entries um so I think it's more just. We need to identify why it's happening.

D

Does that make sense yeah the question just were just inheriting all of the existing message: encoding code from classic yeah.

C

The patching is already why this is necessary to merge the input out of factory floor for filing the PG log I.

D

Think he's suggesting that if.

C

We're gonna bother.

D

Profiling in the first place, that's among the things we could profile but I think their efforts well.

A

Do not even for even 452 K chunk size in profiler. You will see that maybe 60% of cycles are burned only because of only as a consequence.

A

Methods of reading from sister subject it will obfuscate your your profiling of Oh, an OPG, lock, okay, I think I am out right. It.

D

May, but there are really different call stacks. It should be pretty I mean if you profile it once with and without the pageants I think it may be possible to simply see the difference.

D

It doesn't really touch that should.

A

Be okay still, I'm, not sure whether there's betters would be meaningful. I mean that if you have had plenty of Cisco's over a right path that they could have some side effects that it far away from the place. But they are issued this concept not only the direct cost, but also indirect, affecting the affecting cashing efficiency and things very similar. To that I mean.

D

It's like a thirty percent hit to throughput I think it's gonna show up.

D

A

Really significant.

D

Yeah I mean that's from like 11 or whatever to eat, whatever.

A

Okay, after just just saying just make I know that we don't be surprised if, if you see that most of the cycles are burned, handing on handing.

A

Read from sisters, okay,.

E

And you have the question so commence.

D

Oh God run it up. I'll look out for your doctor.

B

Questions I will create a list of questions and publish them. Thanks all.

D

E

Okay, so guys later you next week.

B

Please, dear bye,.