Ceph Performance Weekly, 1 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-10-01

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right: well, let's get this started, I guess um okay, so new pr's this week there was only one that I saw from igor. This is more work on his optimization of pg removal. um It looks like this is part one and two of two uh that might be an update to his his previous one. That was also um uh submitted a little while back, so uh it might be that that old one is being replaced. I guess we'll see. um So that was the only new pr that I saw.

A

We do have a couple of updated ones. uh First is the first part of this from igor. um Also there was a stuff volume pr uh to retrieve device data concurrently that was uh reviewed by paul paul kuzner.

A

So I guess that's still in the works being worked on. um There's this pr to allow dynamic levels in roxdb. This does not actually enable them. It's just uh changes in our code to make it feasible to change those settings.

A

So I asked for some performance data with those changes in place and using some kind of um you know, setting for roxdb. That seems same for for doing the dynamic, leveling and uh they're working on submitting uh some results for that pr, so that'll be very interesting to see uh what they see with. It um looks like more testing on the rgw d3n cache, um that's still just kind of chugging along being worked on and then um majian peng's uh pr to reduce buffer list rebuild.

A

um I think both igor and radic have been looking at that uh it was at one point bailing qa, uh but it has since then been uh updated, so uh probably needs another review and another run through the qa suite uh that that was it. As far as I saw for new and updated prs uh nothing closed this week. As far as I could see uh anything, I missed guys.

A

All right: well, then, let's move on um so in the last couple of weeks, uh gabby has been working on building a prototype for recycling uh the the keys for pg log entries.

A

uh This was an idea that we had kind of had for a while um that we were hoping was going to be good and it turns out that um it probably isn't um what we saw when we tried it out was that we end up in a situation where every single pg log and update ends up going into the database, and we have significantly higher input records, significantly more work being done during compaction, our current behavior, while it does introduce a lot of tombstones into the right ahead log, it does very effectively uh prevent us from uh merging lots of updates into the database, so our current behavior actually is not as bad as it could be.

A

It's just not as good as it could be either um gabby. Do you want to talk a little bit about your uh prototype and kind of things that you've been uh thinking.

A

A

I you're muted.

B

Do you hear me now? Yes, there you go better now: okay, yeah! The mute thing was a problem uh okay, so the original idea was that um instead of we used to have ever incrementing keys every time the version the e-version was the key and the e-version are just monotonically growing. So you would never see the same key. The same key again. Every transaction had a unique key which we created and eventually it was deleted.

B

Once we concluded that the log is safe and our operation was completed, then we had to delete this thing, creating a tombstone and paying all the price associated with that.

B

So what happened is that the pg log entry was written to the right ahead log and in parallel it was added to the main table.

B

Eventually it was removed and a tombstone was added to today's sst on disk, and this eventually means that we've been allowed to flash the main table and get rid of the writer headlock. That was the old design. The new design was saying you know what, if we recycle the same keys and and I just created a mapping table from the e-version to an internal id which just is recycled and roxdb never saw the e-version.

B

It just saw my recycle key, so we kept using the same id again and again and when it was deleted, it was asked by the code to delete it. It was not deleted. Roxxdb was not aware that we deleted this thing and instead we just created, uh we just free up the internal id and later it was assigned to a new request which generated a new boot.

B

um So the impact of this was that we hope that that will get rid of tom stone's tombstones, which probably did to some point to some extent.

B

But the the the side effect with the end predict was that the right ahead log is always moving forward and it is keeping the highest version of the mem table for every um column, family that is using. So we have multiple column, families and the version or sequence number sorry is incremented. Every time you flash it so the right ahead.

B

Log knows that you have ended entries from those main tables and what's the highest sequence that exists and when the mem table is flashed, then we create a new right ahead table and right ahead right ahead. Log, sorry and the old one is kept aside until all column family using it being flushed, and since we never called flush the right ahead. Log was never been able to release, so it kept growing and growing and growing until it reached some point.

B

When you say: okay, that is crazy and then there is a mechanism in works to be saying in this kind of scenario. Let's find which mem table we should force flash, and the idea is that they should force flash the oldest one, because definitely that one anything which is older, then uh the right headlock must be flash. So that's how we end up flashing um the main table. Even so, we never called the lit it was still flushed. Nevertheless, so the entries were flushed and stay again longer there.

B

So none of this was helping us I mean. Maybe it did probably not. I don't think it did anything good. I don't think anything is helping us so then we start discussing alternative to this design in which we will be able to use the main table.

B

So we we don't really care for the main table. The only thing we care is that we want the right ahead. Log and.

A

Do you guys hear me? Yes, your video is gone now, but um still here, so we we.

B

The reason we put the pg log in that that in rocks db, the first place was not because we needed to be persistent database uh object. It's because we needed a pg log to go to some kind of right ahead. Log and the entries tends to be very small, and our thinking at the time was that if every object we just uh issue its own writing to the log or you'll see a lot of right. Amplification in ssd and spinning drive will just say very slow performance.

B

So we thought we should be able to aggregate the rights with the pg. Sorry, with the object node, which we write anyway, we knew that the object node must be written to the right ahead log. So we said you know what, since we're writing them, let's piggyback um the the pg log on the object, not the right, so it will be essentially for free. I think that was the thinking and we do get a free meal free lunch by set by writing into the right ahead log, but there's no such thing as free lunch.

B

So, of course, we end up paying this elsewhere and the way we pay it is all the extra work done by roxby and when roxdb is writing stuff into the sst tables. Actually the t is a relatable is writing them into the uh ssds and the disk and eventually creating tombstones.

B

So there's a lot of overhead that we try to avoid, but we still want to use the right ahead log. So one thinking was to create a special uh behavior in our column, family, for in works to be, and we could mark that those objects never need to go to disk and you could flash them by flashing the main table. We free up the right edit log, which is what we need, but we wanted roxdb to enable flashing the mem table without actually writing to disk, which is pretty ugly because we broke the semantic.

B

But then we come up with this concept saying you know what, if we mark for every key when it was first created and we could mark this key- is new. It was first created, there is no copy on disk and then, if it is deleted, while inside uh the right ahead, sorry right inside the main table, then there's no reason to cure the tombstone, because we know that doesn't exist on disk.

B

So it's like a funny way to keep the semantic going, but still give us our expected behavior and to make things more generic. We even say you know what, if this key ever reach the disk, when you have a tombstone, usually they they keep falling down until they reach the rock bottom.

B

But we said you know what in this scenario, we are actually adding an improvement, because we at the market that this key was generated in this. It's the first occurrence of this key. So when the tombstone is going down down down, eventually it's going to meet the first iteration of the key. It's going to remove it and the tombstone could be removed as well, because we know for certain there are no uh version other version of this key.

B

So that's how we could justify this, this new entity entry. So that's that is the direction we're investigating.

B

So there's two problem of this: the first immediate problem is: can we convince rocks to be convince them to actually push this thing upstream or would they refuse to take it and we'll have to keep updating our copy? Because if that's the case, then let's do the simple one and don't bother with this.

B

The second one is that rockstar code is not the easiest to understand, so I'm spending a few days with this now and everything is virtual, virtual virtual and there's like gazillions amount of files and of course there is like one header with this name, but the cc is implemented elsewhere.

B

So I'm still trying to find my way and see where something is implemented. But I guess that's my c heritage argument with the simplest design, which is like everything.

C

B

It's a side effect by multiple layers, but yeah. So that's, okay, so that's one direction now a complete different direction that me and josh discussed uh about a week ago was saying you know what let's forget about roxby: let's try to get the pg log without using roxdb, and by doing that we don't need to worry about roxy b and we think we can actually come with a better solution. Sorry, with a more efficient solution, and now we still need to see if the solution is covered in all bases.

B

So the idea was: um why do we actually need a pg log? I mean the pg log is not essential for correctness.

B

It is needed for performance without a pg log, when there is uh with the osd crashes, we need to scan all the pgs and find what version they exist and if they need to be synchronized by others. So the peering process is very expensive.

B

You have probably to scan big part of the disk and that could take maybe hours. I don't know, minutes many minutes. Maybe hours on spinning drives. So that's.

A

Obviously depends on your device right like the whole thing is that pg log very much is optimized for hdd to help. In that case, right with, when you move to flash everything kind of changes, terms, yeah.

B

A

B

The number so understanding is that on spinning drives that could take hours to recover, which of course is unacceptable and with ssd that might be minutes. I don't know.

A

Maybe and pg log doesn't fix that right it just it just makes it more likely that if you have a short down time you can you can recover quickly. It's it's kind of a. It was a hack, but it's it's a very specific, optimization.

B

Okay, so the pg log is not a correctness thing, it's an optimization for a quick startup.

B

So then we start saying do: can we maybe minimize sorry, can we make the possibility of failure extremely unlikely, and so so there is going to be one scenario, so the solution we suggested or discussed is not covering all bases. There is one. There is one scenario in which we will still need to do the full disk scan, but we try to make this something unlikely to happen. So the idea goes like this.

B

If uh is it me or you guys appear? Can you still hear me? Yes, okay, okay, so what you said is that this we're going to suggest a solution which is going to work um as long as at least one member survived the failure.

B

If all three members fail in the same time, then we are forced to go to the free pg log solution, meaning can everything and hope for the best, but other than that, we think what we suggest is going to give us a very efficient way to store the pg log without compromising anything else. So the idea goes like this: every osd member is going to maintain double memory.

B

Buffer in which is co is going to collect the pg log entries once the pg log buffer is filled, one of them and then the pigeon logan uh buffer could have to be aligned to sector boundaries, so you're always going to do a full sector right.

B

So once you fill a sector page, you name it, then you can destage it single right, zero right amplification and since you have double buffer now you have the second buffer you could use and you always need to know what what version you have in a pg log for everyone. But this thing we could we already piggyback on every io that the that the primaries and the replica you always know what's the latest version. So as long as one of them survive everybody coming back, they will know what version they have.

B

Sorry, the surviving would know what version they have and they have all the pidgey dogs. So you say: okay, you guys have to roll back or redo everything from this and and moving forward. So by communicating with the surviving members, you could recover the equivalent of your pidgey lock, but this thing, of course, won't work if all three of them disappear, because then everything have to be scanned and paired and recovered. So the question is: is this a viable solution.

A

It seems to me that if you have all three replicas like go down at the same time, you may want to be going through and doing that anyway.

A

It's not not necessarily.

B

Terrible because, like all free falling, the pg log solution that we have today is not safe enough.

A

Well, I don't know I'm just maybe it is it's just I guess that's like a major failure event right if you have three copies across the entire cluster. Failing it's not!

A

I guess to me: it's not the end of the world that you might be doing a more thorough recovery, not saying what we have now isn't safe, it's just it's not it to me. It doesn't feel bad necessarily going through and doing more, a more exhaustive.

B

Do we have any statistics from the field? How how often do we see remember failures.

A

I have had very a very difficult time getting data like that from the field. I don't think that we probably do, but it's possible. Someone does.

B

Don't we collect this kind of information like downtime.

A

We may uh it's josh here: josh, do you know, is that something telemetry actually we'll collect is like runtime data like that.

D

It will collect some kind of runtime data, but that little details isn't there today, that's really something that we could think about. How would capture better.

D

Yeah, just going back to the idea in general, um I think the the concept of need to do backfill in more cases isn't isn't bad. I mean like it's describing it, so it's more of an optimization for recovery. A lot of cases. I think the um part of what I have to be careful about is making sure that we do keep the correctness of the recovery, um even in those cases, with um fewer than three failures too.

D

So for those cases we do need to write like some things down to disk, for the transaction to be complete. So, um like one thing we could try to do. Potentially is um piggyback on the on rocks, cb's right, head log or or spread that right ahead log to write down just the um for any information that we need from the pd log to be able to safely determine which replicas are uh up to date or not, perhaps without.

B

I think you told me that you already do this because of because of the sequence, numbers and versioning, so that the the object not already have this. This concept inside.

D

Right right, so the object node already has this, um but we need to actually if we want to use the object node for that. Instead, it's to mean to.

D

Then we would need to do the full scan, even in like this, at the single or or double clear case again. If we didn't write down the pg log itself or the version itself outside of the o node, we have to be scanning for all the nodes and we never versions there right.

A

So yeah no free lunch right.

A

What did you guys when you were talking? Did you guys at all um talk about uh keeping a separate blue store right ahead log instead of trying to modify roxdbs? I don't think we're gonna get upstream rock stevie changes in. uh Maybe we will, but I I think the likelihood is low. So I keep coming back to wondering if we need our own right ahead. Log and, basically, you know short circuit rocks dbs.

D

I think we definitely wanted to have a single right head log to avoid transportation from trying to do multiple writes that are often pretty small. um Do you think, do you think it's more feasible to get rock tb to like store its radiator logs or elsewhere, then to modify it directly.

A

Well, I think we can just disable rockstevie's right ahead. Log I've seen that documented in various places. So what I keep coming back to is wondering if we should just be implementing our own and blue store and then that also makes it a little easier, not super easy, but a little easier to do other back ends that don't have right ahead logs for key value.

A

Since we're writing we're trying to write full transactions anyway, I guess I just you know it. It seems like maybe we're we're, especially now with what we're talking about. Maybe we're kind of almost abusing ourselves right in ways that wasn't really intended to be abused.

D

Some of the optimizations it has around.

D

At the tombstones burning things within tables.

B

Mark what you're suggesting is to use rocks to be without right, ahead log, and then we have our own right ahead logs, which we manage and we put inside the object, node and the pg log. Is that what you meant? The the last.

A

Part, maybe but the rest of it. Yes, the question is: can we do we have enough information inside bluestore itself that we can do better than roxdb's right ahead log, because we know information about the data that we're putting into it?

A

We know things about these transactions that rocksdb doesn't know about, so we know things that we want to persist into rocks to be versus other things that we don't. Can we make use of that in a better way than we can with roxdb's implementation.

D

It could be concerned.

A

By the process of using internal consistency,.

D

Like if it's it's written down information about what it's changing in its right head, log, that might be more detailed than the information booster has access to.

A

That's true, we know certain things in roxy b knows certain things right.

D

Yeah, so I'd be concerned that if we disabled write a log entirely for rocks to be it could get into a state. After a crash where we couldn't replay uh with the knowledge we had at the booster level and get to the same state.

A

B

B

So don't we have all the information, in our hand, is the right ahead log. Is this the full object that is as it is, plus the pidgey lock, or do we have anything else inside it.

D

It would be the record of the transaction as well.

D

The record of the transaction that we're doing as well like the actual data that we're writing for the cases that need the right blog.

A

Josh, I guess my question would be by recording the transaction in our own right ahead. Log. Can we replay all of those operations backed into rocks to be in such a way that we maintain consistency.

D

I think it's more of a question.

C

Of approximately.

D

Like supposed to be when we were shut down and rocked db's, it had been in the midst of uh modifying.

D

I guess it's ss tables.

D

If we replay a transaction on writing more keys to rackspeed again, is it going to end up in the same state.

B

uh Mark, I I think one possible issue is that roxtv knows when it can release the right headlock, because when it flashes dice's t table, then you could find out which right the headlocks could be freed, and I have this thing we don't know when uh maintave has been flashed, I mean we might be able to find out, but by default something will miss.

B

How do you know when to release the writer headlock.

A

Trying to see if there's any good documentation on what roxte's behavior is when you do disable the write ahead. Log.

B

I think it's simply a assume that you're persistent.

B

But how do you know when to remove the right ahead log, because it's tied to the way roxdb is flushing the main table.

A

Yeah yeah yeah.

B

We would do I mean sorry, you could also do another thing. You could also tell roxtopy that never to flush the m table and you're going to take over, but then yeah yeah.

A

I actually um yeah.

B

Slowly, you've been you're going to implement your own database.

A

Yeah for every everything- that's actually you know hard right, but we left with a little skeleton that doesn't matter nearly as.

B

B

Knows when to flash, because when it's full, it knows it's time to flash or when the writer headlocks gets full and so on and so forth. But when you take over, then you must monitor um the size and capacity of each of the main table and the right ahead. Log, which is part of what rob cb is doing for you.

A

uh In the chat window, I just push. uh I just uh pasted uh a small little bit from their blog talking about uh my rocks, achieving higher performance by uh by I think disabling. The automatic flushing behavior of the right ahead. Log.

A

And manually calling flush wall, that's interesting, yeah.

A

Well, yeah, maybe part of this- is we just need to understand what their api lets you do, both with the red headlog itself and then also when you disable it? What what kind of insight you can get into internal.

A

B

It seems that they implemented something similar to what we are, what you're suggesting they implemented a two-phase commit which I'm assuming so they took full control over the right headlog and over the transaction, and they took control of the flash.

A

Yeah, which I mean we could do that too right. We could presumably then yeah.

B

At least the api is simple.

A

B

But sorry done, then again, sorry, you do the flash button. Oh okay, sorry you do flash, but no write ahead. Log.

B

And the right ahead log is something we create internally, where we put the transaction, the object, node and the pg log entry together as one entity, and we know how to recover from this. So we are good in a way. Yeah. That's good!.

A

Another page that they have in the wiki regarding the right headlog and some of the configuration configuration options you.

D

D

The description there makes it sound like they're doing exactly what we were just talking about, essentially using the mysql bin log as the right head log for everything and then uh being able to replay things from that and therefore not eating not necessarily needing the right analog from rtb. To always be.

D

A

So yeah I mean we could we could go down that route? I don't know if we want to or not, but it sounds like the my rocks. My sequel guys has had success with it. They say here.

B

Maybe you should read any documents done by my rock to see if, if they can explain what was the motivation, how they've done it and, for example, how they determine when to go to.

B

A

Just kind of generally looking that up, I see the percona guys have worked on this. Some.

A

Maybe useful background reading I haven't, I haven't looked at this at all, yet I'm just scanning it right.

A

A

Yeah, this is kind of high level, doesn't have a whole lot in.

A

A

So maybe, while people are looking around at this stuff or reading through things, um the other thing I wanted to mention is that in that spreadsheet I linked earlier, you can see that um we also looked at the existing code with one pg log. Entry is basically setting them in max to one one dupe entry uh setting the limit there to one and then both of those together, and it was very interesting that just saying the pg log entry to one didn't have a very big effect.

A

It didn't really change performance at all and we didn't see a huge reduction in input records, which was the more surprising thing to me, is that it was fairly minor.

A

We saw a bigger change when we switched to a single dupe entry being recorded and then an even bigger change in terms of input records when we did both of those things, so the takeaway that I saw from that was that perhaps there's some compounding uh effect there. It's also possible that this was just because of one sample. Maybe maybe, if I took lots of samples, we'd see you know different behavior, but.

D

Mark, I think that's that's uh relates to how the two pops work um the configuration there is, um if you change the pg logs size, um the dew pops take up the rest of the space, essentially so yeah.

B

Comparisons like.

D

You have three thousand total um entries and if you reduce the number of log entries, they become two boundaries instead, so the I would expect like the we reduce the log to one. You have the same number of input records, but they would be smaller records.

A

Sure, and then would that mean and then would that mean josh that by josh would that mean then that when we trim we actually are trimming primarily dupe entries instead of objects, yeah exactly.

B

Hey gabby, sorry: what were you saying? No, I'm saying what josh says is perfectly aligned with your finding, because when you set the pg doctor, one the result were almost identical to the default behavior. But his explanation is: is that you still doing the exact the exact same amount of of trimming?

B

A

So what was kind of interesting about that right is that with one pgp log entry, one dupe entry, we actually see that the input records and output records are not that much higher than when we completely disable uh pg logo map or pg log entirely. um It actually gets pretty close, it's not quite there, but it's it's a lot closer than our current default. Behavior.

A

Still slower still a lot of overhead in the pg log, we and that's not surprising, it's complicated. It does a lot of work, um so our upper boundary on this is like you said uh when you're looking at gabby, I think about 40, it's a little less than that, but it's it's pretty close, uh but given our current code and other bottlenecks in the osd, we we can see fairly significant gains by improving this.

A

A

I think my takeaway from that is that is still a valid thing to do. We we can both reduce the amount of wear on the the drive by improving this, and we can improve iops and and just guessing, based on previous testing. That will, especially if we can improve the way that pg log itself works. We can reduce the the amount of cpu we're taking per app as well.

B

So we consider three different solutions today, which might so one of them is changing, rocks to be adding this new plug. The second one is creating something completely separate for just for the pg log, a new solution outside works to be, and the third one is what you suggested is to either use my rock instead of rocks to be or maybe emulate my rock's behavior by taking control over the um right ahead, log and flashing, so.

C

Yeah, more than.

B

And it seems interesting, but I just can't through it, but it might be more similar to what we need, but uh the other solution is, of course, what you've said is just like we take control of the writer headlock and with the controller flashing roxdb would never know about the right headlock, it's something that we maintain uh internally, but we need to know that everything needed for roxdb is recorded in the right ahead login.

B

We know how to replay things yeah, but I think, by taking control of a flash, then we are good, because anything not flashed is right ahead. Log material.

A

Josh, what do you think.

D

I think um that does sound like a potentially viewable approach, I'm still being concerned about rocks, db's internal state um and and according to the probability with the right, headlog or end.

D

Yeah then good, let's understand more about what what like the the uh maximization is doing there. How it's been log works nicely.

A

Gabby and josh- and I have been talking this whole time- what anyone else want to chime in on any of this or have any any thoughts.

C

Mark you're talking about free lunches earlier and my the big question in my mind, which is an unknown, is what kind of free lunches are we getting by using roxdb's wall, except for the flushing that we already talked about?

C

Can we like classify that.

A

Well, I I guess the one of the big ones right that that this recycle log id testing showed us was that the uh the the fact that we can tombstone lots of entries is actually you know, really beneficial right. So that's that's kind of the part of the current free lunch that we get is that we actually are making very heavy use of that in these test results. Master default versus recycle log id default.

A

It's it's like seeing like half the input records a little less than half the input records.

A

And then, by completely getting rid of pg log, we have that again, essentially.

A

So there's still room for improvement, but it's it is doing something. um I don't know what else. What else right.

C

Yeah remind me well when you said uh completely getting rid of pg log. uh Did you just like you know, remove that implementation, or did you just get rid of the pg log by means of like making the pg log and to extract all zero? How how did you implement getting rid of the pg log.

A

Sure so, in the no pg log omap case we're literally just commenting out all of the omap calls in pg log, but we're still doing all the other work in the case of no pg log entirely we're actually getting rid of log operation got it.

A

And then fixing you know the asserts and whatever other you know, random stuff. We have to to make that all still.

A

Function so josh, I think, you're right. I think we need to understand roxdb's internals a little better and understand what kind of api we have to actually talk to.

A

D

Yeah, I think that'll help us understand whether it makes sense to even uh look at this idea of maintaining the well outside of box db or not.

A

um There is someone.

A

That regularly posts about iraq's performance and works on it.

A

Let me see if I can find out I'm horrible with names. I should remember this, but I don't.

A

I think it's mark callahan.

A

Caligan, I'm not sure if I'm saying you're right um a small datum and he may be a very good person to talk to.

A

Says, but he regularly posts about my rocks and and rocks db and topics like this, so I suspect that somewhere in his blog, he has talked a little bit about what they're doing regarding right ahead. Logging.

A

And we may want to reach out to.

A

A

All right, well, anything else in this topic guys are, we are we done for this week with this or.

C

A

Thinking about.

C

Well, how do we do uh while we think of a bigger implementation um which might take some exploration and time? uh Can we uh bridge the gap between like where we have one pg log entry, one loop and um the default?

C

Can we get to somewhere um in the middle by I know we need to track two pops for correctness purposes, so, uh first, a test by reducing the dew pops and also reducing pg log um length to a lower value which I think doesn't have any correctness issues. Just as we said it is an optimization for recovery purposes and uh see how how you know how much better we can get with just making those small changes um uh as as we you know, uh get to the bigger picture things.

A

I thought we had to keep like a really high level for do. Bops josh, you, you know this better than anyone. I think.

D

Yeah I mean the dew. Pops are um our practice thing, so reducing the number. There increases the probability that you run into a problem I'll be playing, I'm not important rights, um but but it is a you know, a trade-off of uh probability of getting bad rate versus uh the x-ray over doing.

A

I mean reducing dew pops to one right got us like an eight percent gain like that.

E

C

Yeah, I I remember in the past when we had some issues where, due to not enough to pop tracking, we saw out of order issues and stuff, but uh I I think the real question is what is an optimal.

C

D

Yeah, that's something I think we haven't really. uh We can't it's hard to know for, like uh our all workloads like we just um what we've chosen today happens to work well in our test environments and in our use of environments. It's hard to know. If we reduce that say by half, uh would we still be okay or will we see uh correctness issues in some cases, so we wouldn't today.

A

Josh does a scale with the speed that you can process things like does that do we need a higher number of uh a higher amount of do bops tracking if we have a higher ingest.

D

C

A form of a guess how how many do I so how.

D

How deep do we do we track today? How many I guess today the level is uh 3000 pro.

A

Isd so figure if you've got like in these tests. Fifty six thousand iops, that's like like.

B

Three quarters.

A

Of a second yeah, no, no, almost half a second order, magnitude off right if you've got if you're tracking three thousand you've got fifty six.

D

A

Thought I heard thirty, I thought I heard.

D

A

Yeah, so it's 30 and 56, then it's yeah, oh thirty, thousand, for the whole ost josh. Is that what you said yeah.

D

It's like it's three thousand on average for pg, that's the that's where we got.

A

The ppg, okay, sorry, I think you said for the whole osd. I misunderstood.

D

Yeah, that's that's 300, 000, sorry that three thousand yeah, so it's like three thousand thousand per feet. So assuming a hundred pgs, then you've got three hundred thousand.

B

B

Josh? What about your idea about uh piggybacking the pg log information on the object? Node itself? So we don't have double entries so that because the vast majority of the information in the object in the pidgey log is already in the object node. So if we just add this extra information, so we're going to make the object, node bigger and we could stripe it back when we read them from disk.

B

But when they are on disk, we could keep them with more information, and then we could use them to do the recovery and we're going to cut a number of entries by half.

D

Yeah, I think that actually is very interesting. Is there you would need to end up doing the scan anytime. You have a node coming up and down, but that could be worth the cost.

B

D

B

D

B

But you can only scan the right ahead log and gather the information from from inside the object. Node.

D

B

Can't guarantee that just that extra information.

B

I mean the object, not in the pg logs. They are always going together. It's one of these and one of this.

D

Yeah, I see what you're saying so, if you have the right login.

B

You take all the extra information which the pidgey dog has. It doesn't exist in the object. Now then push it inside the object node, but only in the disk copy. You don't have to put it in the memory, then roxtp would have half as many object to manage, and the compaction operation would deal with half as many of them and anyway, the vast majority of the data is repeating himself. So even the right would be shorter.

D

Yeah, I think I think we still want the pg login memory for in that, in that case, to deal with like uh uh network failures and and and uh appearing so we wouldn't have to scan that scan the uh the disk for like transiting issues that just caused a little bit of flapping, but for any any kind of process. We start when we actually come up again. I think we think we could get the information about which objects recently changed from the right head. Log and.

B

Are you saying that if we add a sorry if we encapsulate the page log information inside the object node, then every object of the memory would need to have that encapsulated mem information.

D

No, no, I'm not saying that um I'd say we would want for um to be able to handle um temporary network clips or other failures where we have some kind of um that was used for running, but they're like the acting site changes for some reason.

D

um We still want to the one that we just want to keep the pg dog in memory just to be able to very quickly determine what uh what we need to do. What we need to recover.

B

D

That's kind of we're talking about that.

B

Then it's easy thing right when it's not inside the pg. Sorry, if it's not inside right, then it's a cheap thing to say.

D

Yeah yeah exactly.

D

Adding that extra info to the um I mean essentially in this case we at that point when we need to do the scan.

B

Object, node in in the log it's going to be, and you just create the pg log entry from the object. Node assume that we encapsulate everything.

B

Do we always have one object, node and one page log going together, or can they be original uh in different order or in.

D

It's all going in the same transaction yeah.

B

It's always a single transaction yep. Then. Why do you need to to separate objects.

D

Yeah, I think, you're right. I think we could combine them.

D

Yeah there's a lot of overlap. um I think the pg has a little bit extra information about like what type of operation it was and that we could add that and and which extents were modified. We could add.

D

I think I think that the reason we didn't.

B

Do this so far is that it's kind of different.

D

Different levels of those traction right that the pe plug is is um being modified at the oc layer, whereas the owner is being modified at the booster layer today.

D

So if we kind of push some of that concept down that, we can again add an interface to the to the object store so that it can store the um pg log with the node updates.

B

Don't call it pg log, just add few extra fields to the object. Node.

D

Right right only.

B

When you do the scan, then you can generate on-the-fly pg log entries from the object node.

D

Yeah, I'm saying we need that we, the implementation, would uh implied. I.

D

And then, when we start up, do the reverse.

B

Yeah but it's same thing that it's something that we control right. We don't need to ask.

D

B

It sounds ugly in theory, but eventually everybody is doing this at some point.

B

The obstructions are very nice, but if you look at the nt driver model in the beginning, they started everything amazingly clean, a lot of obstructions and layers, and eventually they realize it's so inefficient. So they made to be like a crazy cuts through from every every location.

D

Yeah, I guess the thing they consider with this design would be then how much right log work or how? How much um journal we're keeping around. Because at that like how many? How far back we're going essentially.

B

Yeah right I mean yeah, so this is the problem because we've um with the pg log, you have direct control on when to remove them.

D

B

You know what sorry, no I take it back with. If you combine them, then everything is either in the pg-log or it's inside roxdb, and if you really want you could come the first ssd layer in in. If you don't think we have enough in the pg log, you could scan the first sst level.

D

Is there an interface to be the.

B

Latest one level yeah, I mean just the first level. The first level is typically very small because it's one two one mapping of the memory of the main table, so it can't be really big.

D

Yeah but I guess I'm asking: is there an interface for the racquet for actually to let us like scan in that level order, rather than in like that key order.

B

It's interesting question: I need to check it. I've seen something about iterate through, but I'm maybe I'm confusing this. Definitely the the writer headlock had this iterate through api, I'm not sure if the ssd have it, I'm sure they have it somehow.

B

Maybe it's an internal api because when they do.

D

Yeah yeah it could.

B

B

Right, if we can be enough, we might have enough entries in the right-hand lobe we could actually control the size of the right-hand log. I think there's some parameters to say uh it might be just defined by size. I don't think you can say how many entries, but you can control the size. So if you set the size to something, then you know how many entries you're gonna have, or you can say. I need at least that many and might be that the right headlock size that we have now is already more than enough.

B

Sorry take it back, we are not affected by the right headlock size, but the main table size, because when the main table is being flushed, then the right headlog is being removed, and you know what there is also an option to say: keep the right ahead log for that after you, you discard it. You can ask him to be kept for some time, so maybe we could do that and then the right headlock would be the size that we can.

B

A

I have a suspicion that if we really dig into what the myrox db guys have been doing that, we might unearth some discussion similar to what we're currently having and maybe we'll we'll see if they they went, what direction they went and why.

B

I think what they've done is even an extra optimization to what we've done, because we are now trying to get rid of the pg log entries uh cost. I think they try to get the whole the object, this stage api to be more efficient. Maybe they can afford to be a bit behind which could happen in theory. Just imagine if you make yourself a plan- and you say you know what I know that I'm going to be right in this 16 object.

B

So you start preparing the right headlock, but you don't need to stage it immediately because you didn't commit any operation, so you create batches of work, you you put them, and you say: okay now that I'm ready I'm going to put it in a safe, non-volatile location, and I can tell rocks to be about it, but you could just prepare all the work that you have to do and then batch them.

B

A

We are essentially out of time for this week. That uh is this a good place to stop. Are we? We have things on our homework to do for for next week. I think.

B

We have of warm works to do. We have to read my rocks uh design papers and see why they did things. As you said, we're going to learn a lot from them.

A

All right, well, maybe that gives us collectively some some things to work on, as we can. uh Is this a good topic to continue next week, guys.

B

D

Yeah sounds like it.

A

Good all right! Well then, let's, let's uh for those interested, let's do some reading. Let's try to uh learn something in the next week here and then we'll we'll uh we discuss next week, sound good.

A

E

A

Well then, have a great week, everyone and uh we'll we'll meet up again next week. Okay, thanks see, you guys see ya.