OpenZFS Leadership, 27 Jun 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: June 2023 OpenZFS Leadership Meeting

Description

Agenda: RAIDZ Expansion; Github Action Runners; next-gen Dedup

full notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

All right welcome to the June 2023 open, CFS leadership. Meeting um I have a couple of exciting topics on the agenda today, um so don. Why don't you kick it off.

B

Okay, um yeah I was um I'm gonna. Do some contract work to see the raid Z expansion uh to completion I know there was there's been somewhat of a stall in recent uh in in the recent past and uh I've, um so I just want to give that update. I started on it last week and successfully rebased the current master and have been writing tests.

B

um I've observed a few um issues that I'm that I'm working through right now so hopefully I'll have a pull request. Open, I I would guess this week. If I can hopefully resolve the issues and then we can move forward to iterate it to completion.

A

Awesome um look forward to working with you. If there's anything I can do to help out um walk through the code or anything um happy to help out.

B

That's great yeah I'll walk through I'm, entangled with it by debugging these things until now, I'm sure I'll have a big list of questions uh of how things work or why why things were made the way they were I've been going through the the test, Suite test and annotating. Those and also I'll, probably have some questions as to why we're testing it. That way, those might be for all right. I can't remember his name, but it was v-stack. I. Think that.

C

B

Done those tests.

A

B

So, um but um pretty good set of tests to start with, so that I'm I'm uh I'm grateful that those tests are there, because it's really it's exercising the code so awesome and I think the Z Tab the last uh bit of work. uh Those pull requests that you had mentioned to me Matt were we're all about z-test, so my plan is to take get that on sort of second. After I got all the test. Suite Stuff working makes.

A

Sense, oh, did you want to mention who is Contracting you to do this work.

B

um I think it's okay, I, don't think I'm under NDA. uh It's IX systems, so Alexander Alexander's company, cool.

A

B

Makes sense because it you know it's uh it's more of a hobbyist I guess you know so two net scale would probably pick it up.

A

Cool well thanks a lot for being worked on and thanks tag systems for sponsoring that.

D

Because these phones are the original work as well right that you started.

A

Man, the cbsc foundation sponsored um the book of the work that that or they selected all the work that I did yeah.

D

um The first one back in like 2016 or whatever I think IX was part of that. Then.

B

Okay and I don't know if v-stack is still wait had they put any money in or just maybe time for this.

A

uh I mean they were doing work like doing some development and debugging and stuff. So you know just just contributing the open source collaboration aspect, which is great.

A

Cool well uh I know that work is long overdue and uh I I look forward to see it completed and put into your foot into production. uh You know, I regret that I wasn't able to get it all the way, um all the way all the way done, but uh if I can do anything to help out happy to do so, cool.

E

um Alan, this is uh add just a quick comment on that too. Ix has definitely had an interest in this um for a long, a long time um yeah as Matt said, the the foundation was the primary sponsor of the the bulk of the work from the beginning. I think IX did contribute a small amount early on as well right.

D

In 2016 or 17, when it first started, yeah.

E

I think maybe possibly.

D

Just via the foundation, even.

E

Yeah um but but in any case um you know, IX has certainly had an interest in it for a long time and the foundation also wasn't interested in it. So I'm very happy to see that it's uh it's gonna finally come uh um to a closure.

E

um Also uh I mean I think it sounds like things are, are in good hands, um but if there's anything um from the foundations side as well, that we need um I'm happy to to have that discussion as well.

B

Yeah I don't know if folks would be willing to to do a test run on FreeBSD side. I. Don't currently have a setup to do that, but I can I can probably set that up.

E

Yeah I'll, send you a note uh offline and we can see about um I mean I've, got some interns and things as well. That might be able to um to try out some testing at the same time, and things like that so we'll see, if there's a a good way to collaborate on that.

A

Cool, that's great, well, I, look forward to seeing uh seeing your new PR and closing my very outdated one.

A

um All right uh next item on the agenda. We have from Tino um around testing.

A

F

Sorry for my bit, English of course I'm I'm, a native speaker and my English is not the best, but I will try. Modest reading is really easy for me, but but for telling something in English is a bit difficult. Mm-Hmm so can I uh put my screen on or not into the zoom meeting yeah.

A

F

F

Is it seem yeah? Yes, so we got uh access to to a lot of machine. um This is an overview I'm. Sorry, this is in German currently, um but uh you will you see, I mean we got access to power, PC the 64. little Indian or big onion. That's up to us what what we, what we want to use another Forum, CPUs and and instances at a maximum of 10., and they are already inside some- some different distal distributions.

F

C

F

On on this CPU, because it's good of action, Runner will not run currently on powerpc button.

F

F

Count I will I will just log in into it.

F

So there we are, we have a previously feed over Linux, TBN and so on, and all these instances have two CPUs and and 30 gigabytes of disk space and four gigabyte of RAM, and these.

F

Distribute uh action Runners for us in GitHub, so we can just Define our yaml stuff and it will run on on every machine and after it is run, it will completely reset and then be free for the next one. Of course, this is the list which is currently uh active on my personal GitHub. Instead of as settings, we can just pull this over to opencfs when, when it's really done, when it's ready for for distribution and I can of course show some action which is done just before some minutes.

F

So this is uh an arm system with some beginning of setup, uh not not currently uh finished.

F

Here we have the runner, it's a special Runner, it's uh and another different project from from here Christopher hikes, and he has made with the language, go some.

F

Some nice scripting for working around the the limit of GitHub action Runners to run only on AMD, so this one this is this Moana is. It is a full GitHub extra Runner and it's done on on a lot of different CPUs.

F

um Maybe we have a list? Yes in the releases there, we have all the different systems which are supported.

F

This is the fullest, and you see we can run it, maybe everywhere GitHub action Runner with with this thing, but currently no power PC on FreeBSD. If power PC is only supported on Linux currently because go goes, the limit the go language isn't a port currently on previous the power PC. When this is done. Open ZFS can also be test compiled and test, and all the testings can be done also with GitHub.

F

So I was switch back to this overview currently or this one is idle and I will create just more the younger change to maybe run some thing on arm previously. Developing.

F

And looks, of course, and 10 words Maybe, so this list can be changed easy going. Every time we should test.

F

And then we will see that some of the actions of of the available Runners will run. They are changing from either to active. They are executing, of course, some stuff. We can see the output here.

F

So this Runner is running on arm pbsd 13..

F

Here this one is running on power: PC I'm, a Linux 9.

F

This one is running also on power, PC little Indian, Central, S8 and so on and so on. So we can all these uh 25 27 boxes- I have currently could be in this list and they are, they would run some some building of CFS in testing. So I think this would be really cool when this is finished and we have it Upstream. Yes,.

A

That's great, um the testing that you're doing on these machines is it? Is it just building the building ZFS or is it running like the ZFS test? Suite or z-test is.

F

The full test- feed, of course, okay. So the fourth thing- and we have uh we can do this- we have the okay or from from the director of the open source lab there.

A

Cool, that's great yes, and um so the other question is about kind of resource utilization. You have that list of Runners. It looked like there was like one of each type.

A

Yes um like each uh OS and um CPU architecture is the idea that, um like when you run a build when you open a new PR, it would kick off a build on all of those different ones and then like. If there are two PRS open at the same time, kind of one of them would be running the tests and then, when the machine frees up, there's there's like there's only one like there's one arm64 Alma Linux 8 machine and that's going to run the test for one PR.

A

And then, when that's done, you can run the test for a different PR.

F

Currently, it's limited uh by this, this number 10. We have only 10 instances and uh we have 10 different system tubes before we we take away some distributions here and then we may also have to Debian, 11 or, and then we can maybe have if more power for the real test. I I, don't know how how long the full testing Suite on all these machines we run. I I did it some day with power PC and the power pc pc machines on nvme.

F

They are really fast, so this I think maybe two or three hours for the full testing, but but not more. The arm and the AMD systems are a bit slow, but AMD. Of course you may also take over. This is the thing how how it's done and set up also some some openstack with own AMD machines, and then we can also start another podcaster of new runners.

F

Maybe the word.

F

Can be replaced with this, because then we have always the same. This.

F

A

And um do you know what Hardware is backing these like for the I'm, especially interested in the uh arm? 64.

F

Hardware, stick.

F

A

You see it yeah, okay, so it's from it's an ampere system, cool.

A

um I think this is great uh I think getting testing on these additional architectures um that, like ZFS works on, but uh you know it's not really tested, uh at least not for every PR. um Getting that testing consistently would be wonderful.

F

I need some time and then I will start to start a pull request for the, and then we have it. I think.

F

Are there any questions or some some.

F

Other hints what should be used. Another distribution for, for example,.

D

C

D

Amazon had offered to donate resources to run more of the arm, 64 testing on their graviton instances as well. uh It's just mostly I think a matter of coordinating that with them and and actually hooking up the stuff to to run. Those builds.

A

Yeah um Tio: do you mind stopping your screen sharing so that it's not like going through a couple of different tops, um yeah I told um so we have uh Amazon has given us some resources to the opencfs project um and um and uh Tony. Do you want to talk about what what you're working on uh moving some stuff over to there so far, I think we're just doing the x86 stuff, but uh it definitely could be possible to use that for from we're testing on them as well.

C

uh Yeah yeah I can go and do it a little bit so I've been working on a couple different things with our open, ZFS AWS account. The first thing is moving over our S3 bucket that hosts all of our repositories, all of our RPM repositories from the Livermore National Lab S3 bucket to the open ZFS. One second thing would be moving buildbot to run on the open ZFS around our open, ZFS AWS account, and then the third thing I'm working on is looking into the future.

C

It would be nice to convert all of our build bot stuff into GitHub runners and, more specifically, I'm working with this project called ec2, GitHub Runner, which basically spawns off a GitHub Runner, and then that spawns off an ec2 instance and that allows us to run os's that aren't supporting supported natively by GitHub Runners.

C

So right now, I think they just support as far as Linux goes like Ubuntu, but we need to test it on Fedora and Centos, so I've been kind of in the shadows, testing that I've got it to basically run hello world in an instance, but that's kind of like 80 of the work, because there's just a lot of setup. You have to do so.

C

Hopefully, one day we'll be able to to stop using buildbot and just go purely with runners.

A

It sounds like the um the solution of having a GitHub Runner that then executes. Like then controls, another machine um is maybe similar to the the thing that Tino is showing us to like, create or GitHub runner on this other architecture.

A

Is that I understand that correctly.

C

Yeah, it sounds very similar. The only difference is that the instances that my stuff would be spawning off would be in in ec2 versus real Hardware.

A

Yeah, so maybe yours, it's like doing as a little bit more tight integration with ec2 to like actually create those instances. On Demand versus, like the theme tune, is using it's just like there's already a machine sitting there and it's going to run some job on it.

C

A

Cool well um I mean that all sounds great I think expanding our test coverage is wonderful um and so I look forward to seeing both of these projects. Move Along uh questions for Pony I. Guess uh we got there by talking about um uh arm 64 testing Kony du.

A

Have you thought at all about um using the Amazon uh graviton stuff to for testing.

C

I, haven't thought specifically about that, but there there's nothing. Stopping us I mean we can launch any instance. We want Theory.

A

Cool yeah I mean those are the ones that personally I'm I'm interested in those instances um making sure that those work um I think probably there's a lot of folks at least people who are doing ZFS in the cloud. You know, that's probably the at least second most popular um architecture that they may be running on.

D

Yeah for sure- and you know in general, those Amazon instances are less expensive than the x86 ones for the same performance. So if we can work there, then it's a big win.

A

Yeah yeah I mean it's something where like maybe maybe, if we don't have the money or resources to test everything fully on every OS and platform combo, we can do the full tests on the the cheapest ones, which might be uh AWS graviton and then do you know a more limited number of tests for every PR. On the other platforms.

A

But we can get to cost optimization once we have something working.

A

Cool uh other topics for today's meeting.

D

I guess Brian's not here I was wondering if we had a status update on the 2.2 release or 2. yeah. The next version release of open ZFS last.

G

Week he mentioned to me that he'll be on vacation. This weekend should return next and start the branching process. Okay, he could start before, but was already traveling somewhere makes sense. I can't wait for it to happen. We are stretching it quite a lot in which contexts are related. uh If somebody has time to review a few pages, I have half dozen different optimizations be bigger and smaller right now open.

G

It would be cool if they could be landed before Brian saw most of them and it was okay with it. But a few more eyes would be good because things happen.

A

Yeah sounds good.

D

Does anyone happen to know like kind of similar, to Alexander's point of what the cutoff will be for I'm guessing in order to get into the the next long-term release of Ubuntu? We'd have to have it out in time for the next not long term release if Ubuntu, so they have a release of testing before or whatever it's like. How soon does 2.2 have to be ready to make sure it makes it into the next Ubuntu.

D

Because I have no idea.

A

I also have no idea.

A

Maybe this is a question for the mailing list. Somebody there might have have.

H

The next step onto LTS is April 2024.

D

Right, uh uh that's when it'll be released so it'll.

H

Obviously, yeah I mean I I, don't know I'm not involved in their development process, but I would think this fall probably well.

D

In particular, I wondered if it had to be ready for their 2310, not LTS release, so that you know they've, basically shipped.

H

It yeah I'd, like as a user I'd like to see it having gone in one of the interims yeah.

D

Yeah and so, but it sounds like the Brian's plan will meet that requirement, so we're okay.

H

D

Would think, but if we start getting releases in two weeks then or release candidates in two weeks, then we should be okay.

A

Sounds good? uh What else do folks have to talk about today?.

G

There's uh one more thing see if somebody wouldn't want to command: there's a PR open to increase uh dedup blocks from current 4K to something bigger, and this goes there'll go in discussion to how high could it be and current consensus to try to go to 16k, because I was wishing to go higher I'm trying to push it lower. But if somebody has other ideas, small PR is open, practically changing default and making them tunable.

G

So uh Alan saying so. Your team is also working in that area. I would appreciate your comments if you already have any numbers I like I I have mentioned you there already uh from the point that maybe it doesn't work to touching it right now, while developing this already going in parallel, but it seems like yeah he's interested to push it. So if you could go.

D

Yeah and I can uh I'll see the point of you know the work. We're doing won't be into that too. If it starts shipping in two weeks, um and so did we want to at least have tunables by then but I, don't know, but I agree with you, I, don't know what changing the defaults is. A good idea. At this point, um we've mostly been focused on the the work to make the amplification not as bad and then once that's done.

D

We expect that increasing those indirect and the leaf block sizes will have much less of a penalty, but until we were there, we don't have as many measurements as we'd like, but uh we can do some in the interim to to validate what it looks like, um because we saw we're a different customer uh completely unrelated.

D

It turned out they were using dedup and while doing some benchmarking, we saw uh transactions stalling for multiple seconds, just iterating over the freeze, uh and so just the fact that a lot of blocks were overwritten in this transaction meant that the transaction just sat there spinning at the end- uh and that was quite sub-optimal for uh you know, they're doing fio and then there's just a pause for seconds. Well, no new rights happened because the transaction is Flushing and the next transaction group is already full or it hit the the dirty Max.

D

uh So yeah there's a lot to be done there, but I'm with Alexander that at the moment, I think increasing the the leaf size. There would just make it worse.

D

uh While you do get the advantage of compression and maybe it'll make the on disk size of the table better and maybe even a lot more of it to be cash and RAM I, don't know that the amplification will be worth it, but maybe it's up to the user to make that trade-off.

H

Yeah speaking of dedup, there was some talk about doing an offline dedupe uh based on the you know. Various new kinds of block pointers is that.

D

You mean on brt, yeah.

H

Is that something that's actually likely to be worked on.

D

If there's interest I suppose, uh but right now, what Clara is working on uh is just a different way of doing dedup.

D

That will not suffer as many of the problems kind of taking some inspiration from uh Matt's original log, YouTube idea, but applying some trying to deal with some of the limitations it had of the whole DDT having to be in memory all the time uh and just uh the way the man's version had to rewrite the in when it condensed the DDT had to rewrite the entire thing as one transaction Group, which again could lead to that thing of the whole system, sits there and waits until that transaction is done. Writing out.

D

If your dedup table is 100 gigs. That could take a significant time.

H

Yeah I mean in terms of Interest. Yes, we ran for a while with three dubon and the saving was really quite dramatic. It was like two to one.

A

Yeah I think at a high level like the stuff Allen's talking about is like uh in in implementation, detail, detail of um of the current dedup property and and how that all works um versus the block reference table. Brt thing is like a totally different mechanism.

H

A

I understand um that could be used for like after the fact dedupe, um but I don't know that anybody is working on that right now.

D

You know if it's a science experiment I wanted to play with in the back of my head, especially for something like oh I. Have these two vmdk files that are similar go find the blocks that are the same and and use brt to fix them, uh but with our current work on dedupe, maybe we can get dedup to not suck enough that you don't have to to worry about trying to do offline. Do you do.

A

Yeah that'd be great.

D

I, don't matter how much more do you want me to say about that? Like do we want to look at the design stuff or not yet or no.

G

A

D

You muted again sorry I didn't catch what you said.

G

G

I guess it would be interesting to show people if you have something to show I.

D

Do so let me fire that up quick yeah.

H

That would be interesting.

D

All right, so uh just a quick slideshow here uh talking about some of the problems with d-dube and then how we attempt to address them with a list laundry list of uh design changes to dedupe.

D

So the biggest thing is that we did. You have to do the read before the right. uh So every time we get a write if the DDT is not cached in memory, we have to do a random read to disk or usually multiple through the indirect blocks, and then the leaf block to find out if this particular hash is in uh the DDT and kind of to what Alexander was talking about before. In that pull request. uh Part of the issue is that we store those Leaf blocks in very small records.

D

4K each and even the indirect blocks are also limited to 4K instead of the default 128k for each, but that's because of the right implementation problem which we'll get to in a second. The other problem is the DDT is sorted by the hash. So if you you know, during this transaction, we wrote 100 new blocks. uh The hash means that those are going to be kind of evenly distributed across the entire DDT.

D

So we're going to be writing separate blocks of the DDE for each one, depending on the size of your DDT, but in general the rights are going to be randomly distributed across the whole DDT, not all concatenated or grouped together, like you, would have when you're actually writing a transaction to disk.

D

Then we have the size of the DDT. It's a relatively big data structure. uh Part of that is each entry.

D

So for each hash that we have, the key is 384 bytes, but the body is even bigger because we have the three DVA slots multiplied by four sets, so we actually stored 12 dbas for every dedup entry, uh one for all the dvas for a copies equals one block, one for the copies equals two and another for copies, equals three and then there's also still the deprecated support for ditto blocks, where, if we do block a lot, we'd write an extra set of copies of it, uh and so that means we're having a lot of these very large DVA slots in every entry.

D

When most of the time, almost all blocks that D2 are going to have copies equals one and even if they don't, uh there are ways around that and then the eye of simplification, especially if you're doing a z, Vol with say 8K record size or wall block size. If every time you write 8K for that or a database or whatever you're, also writing a 4K block of the DDT and its indirect blocks, then you can end up with.

H

D

Know more of your iops going to updating the D2 table than actually writing data to the disk and I can really hurt and the same thing can happen with just running at a right bandwidth, uh so part of that the first thing would be implementing a dedup quota.

D

So if the amount of data we're deduping or the size of the YouTube cable is going to be too big for memory or for a dedicated device, if you have a dedupe special device, then we'd be able to say you know once you use up this much of the DDT, we'll stop deduping stuff in order to have enough room.

D

um But for that to make sense, uh you know claro tried to implement that in the past, but the problem is without the zap shrinking work. uh The digital table never got smaller, and so as soon as you ever hit the limit, no matter how many blocks you erased, you'd never get any in the future. So with the zap shrinking work, we're now able to actually shrink the DDT and get room back.

D

So when you delete files, you'll be able to continue and move forward and get dedup again to take that even further I used again an idea from Matt's log dedup concept of the DDT pruning the deduct table is split into two chunks.

D

The list of unique blocks that aren't dedup and the list of duplicated blocks entries in the unique List have to be there so that when a second copy is written, we find it and promote it, because that's going to know where the the existing copy is, which we can't normally find from just the hash, but eventually that table gets really large, as you have blocks that haven't deduped, and so we want to find a way to delete some of the entries from there.

D

So if a Block's been on there and it hasn't deduped in the last hundred thousand transaction groups or whatever criteria we come up with, then we could remove them from this list requires some special processing so that when we free the block- and we see the dedupe flag, we don't assert that it must be in the dedup cable. But we know that if it's not in any of the dedup tables and the d-bit is set, that it must have been on the unique table and got purged, and so it is safe to free it.

D

Even though uh we don't it's not on the list, we know for a fact that it was uh only had a single reference and so looking for ways to basically get rid of the oldest entries, because it's the newest ones that are most likely to dedupe right. If a Block's been on your pool for a year and never dedupt, it's probably not going to do tomorrow, but a block you wrote today has a higher chance of deduping tomorrow, but also to deal with the cash performance effects.

D

uh Another feature we have is deduct preload, so making on demand or automatically at pool import to basically pull the whole DDT Into Cash, uh so that the performance won't be bad, especially over say after a failover to uh in an aha system.

D

You want the system to pull the DDT in quickly because otherwise, in the field we've seen, customers where yeah after we reboot or failover performance write performance is terrible for three days until the DDT Gets faulted In by all the random reads and so being able to just say, go read that sequential label improve things and then to what Matt was talking about we're looking at, can we optimize the size of those blocks and possibly do some defragmentation of it kind of related to the zap shrinking is, if entries or two sparse.

D

Can we rewrite that section of it to make them less sparse and also mean that a whole sub tree of this app will now be contiguous again, instead of being scattered all over the disk, meaning that prefetch will do a better job of getting us more cash hits on dedupe and I mentioned about uh having four sets of three slots for every um Block in the dude table, uh and one of the things we want to do is part of the reason for that.

D

Right now is to deal with the different copies equals, but we've got a concept where, if we have a copies equals one block in the deduct table and later somebody bursts, a new block with copies equals two. Instead of having to store that in a completely separate slot of two dbas, we've upgraded the existing entry by just adding just one second DVA to the existing entry uh downside to this is, if you write a copies, equals one version and then a copies equals two version.

D

When you're free the copies equals two version, we're we're still going to keep two copies in the dedup table. We won't free the space until the block that hash goes all the way to zero references, but I don't see that as being a a common case where people are mixing the copies equals property on a lot of stuff.

A

I mean I. Think realistically you could just say: uh YouTube only works with copies equals. One.

A

What you're saying here totally makes sense, but I think you, you could probably take it even further.

D

So I would save another a couple of bytes in every entry if we just only had to keep one DVA for every entry, we'll have to consider that uh and then also looking at doing better prefetch. For example, I said the the freeze were blocking that system for a long time, uh especially with freeze.

D

We know which blocks it is we have the block finder, so we can uh maybe get more prefetching happening there to pull that part of the DDT in, so that we're not waiting on a synchronous read later uh and trying to look at. Can we do we calculate the check sum of the block early enough in the right path to be able to prefetch that part of the DDT and make a performance difference there as well, for when we're actually thinking it out, especially if it's an asynchronous right, uh the other big thing is.

D

Can we fan out the zaps now right now we have these two giant zaps, but if we did them say on a histogram of the record size, we know that you know a block. That's 4K can't have the same hash as a block. That's one Meg! So can we look in a smaller zap and have more locality? More of all, the blocks of the same size will be in the same zap and maybe improve the performance of reading and writing there.

D

But the biggest part is the log dedupe uh a different version than Matt's original one.

D

uh But the idea is: writing an append, Only log and maintaining those changes in memory until after some uh criteria, like you know, uh it's been too many transaction groups or the logs too big or too old uh to then flush those out to the normal zap data structure, so kind of a hybrid between log space map and Matt's original log dedupe, where we will, uh you know, keep a log so that creating birthing new blocks is just depending to this log object, but eventually we do flush it back to a normal zap so that we have the option of not having to keep the entire DVT in memory at all times, especially if we're having a dedicated fast device to be to store the DDT and we're not going to need to have the whole thing in memory all the time it gives us back.

D

This ability, to you know, fault, or to page out the DVT basically and be able to pull the tune off disk again and once we reach a certain size or age or or whatever. Then we would flush these logs so that we don't have to replay them at import to make sure the time for import or failover doesn't end up getting long, but by grouping a whole bunch of changes to, hopefully a smaller zap with the mix of the sharding and this logging.

D

uh That would be able to amortize the cost of all the indirect blocks that were changing and possibly be able to make the dedupe uh Leaf sizes much bigger so that a they'll have more compression but also we'll be writing a less fragmented uh deduct tables.

D

And then the by using separate logs, it means, if we, you know, have a log of just 4K writes that are being deduped. We can collapse all those right about to the zap and then truncate or remove that log item entirely instead of having logs that go forever we'll be able to, you know, pick a certain type of Rights flush them all to the DDT and truncate. The log, instead of having to you, know, have logs where we have to have a ring buffer kind of thing and we're always scattering around on it.

D

uh So yeah, the main advantage to this is that we don't end up having uh to fit the whole DDT in memory, so we don't have to have a limit on the size. So if you have a you know a two terabyte nvme to store your data table, you can have two terabytes of dedo foundries. If you think that's a good idea, um but it means most of the memory requirement of this is Arc.

D

So it's evictable, uh whereas the original Matt's original log, D2 design, that memory was all kind of overhead and you couldn't ever not have dedupe table in memory. So if you do the table wanted to be too big. You just had no choice but to have that much memory.

D

uh So yeah that's what we have for our design so far, any questions or thoughts.

H

So the the big problem with dedup is I understand it was tended to be memory usage. Do you have an estimate of how much memory you need to get reasonable performance sort of compared to your disk space? Well,.

D

It really depends more on your record size how many different d-doop entries you're going to have for the number of bytes. If all your records are 16, megabytes it'll be a lot less than if.

H

All your records are, let's assume, they're 128k, because.

D

That's right now, I think they're about 176 bytes of RAM for each entry that it keeps in memory uh and something like 200 and something on disk. I forget the exact numbers, uh but by doing some of the things like what Matt talked about, if we limit it to just one DVA in total, uh then we get a lot different uh effects. As far as how much memory it'll take. We can reduce that by a lot, and so we can get the entry smaller.

D

Then it will make uh a bigger difference about how compactly we can do that and uh if we can reduce the amplification cost so that we can allow a larger record size that will allow the deduct table to have uh be compressed again because right now, if you have 4K sectors and 4K blocks, then we just disable compression, um and so, if we can get that suppressed, then we could cache more DDT in the same amount of memory, thanks to compressed Arc.

A

um So uh like each of those improvements individually like they make a lot of sense to me, um I wonder if, if we're really going to do the log stuff, um which I think.

A

Which is then backed by you know, you have like these logs and then each log has associated with it. uh Another.

D

A

Any other, like look-upable, under structure like this app, um but presumably you are rewriting that zap in its entirety like when you flush the log you're, basically going to be like okay well like now we're reading the whole existing zap that corresponds to that log and then we're writing out all the blocks of it. Right.

D

No uh basically, whether it be the log will be appending the things that we're going to change about the zap, so basically increments and decrements of ref counts. uh And then, when that log gets big enough or old enough, we will apply those changes as one transaction group to the zap. So we won't have to rewrite the whole zap.

A

So, in order to apply those changes like if the log has a non-trivial number of Entry entries, then probably every block in the zap will be modified so in practice, you're, you're reading and writing like every block of the zap.

D

Like some of the zaps we've seen, uh some of the detail tables we've seen in in the wild are 100 gigabytes, yeah.

A

You're choosing the size of each Shard, so you're choosing the size of each zap and well.

D

So the uh the shards have to be static, so it'll be uh on some Factor like the the block size, and so the shards won't all be the same size there'll be there's one chart for all blocks up to 4K and another one for all blocks between, like a histogram of the record size. Okay,.

A

D

They won't be equally sized necessarily it'll depend on how your rights are distributed, or your.

A

Records are distributed all right, so in that case, I don't quite understand how you would like you're saying each Shard is arbitrarily large and you're going to rewrite it in one txg.

D

When uh we should just be applying the zap changes as if you just made all those changes to the zap in bulk, so.

A

D

Is you still keep this quite small, so it wouldn't affect.

A

So, are you sharding the logs, based on the hash.

D

Like the record size or whatever so.

A

H

A

H

D

The changes from a log to the zap we can truncate that log, like you, atomically, replay the whole log. If that's like.

A

You're, just delaying you're doing the exact same amount of work by just delaying it. If you're saying.

D

A

Like you're you're logging, it, but then you're doing basically the exact same thing: you're just like spreading it more evenly across the txgs because, like when you're playing the log into the zap, those are going to be random. Accesses.

D

To summary, but uh ideally we, uh like you, said because we're touching so many of the blocks that we'd amortize the cost of rewriting those blocks and of the the indirect blocks, and we would save a lot of iops by not having to rewrite so many yeah.

A

If you read the entire log into RAM and then sort it by by hash values, then you can apply it in optimal order so that you only have to modify every block of the zap exactly once for an arbitrary number of changes um that that seems like. That would be good.

A

But in order to do that, you know you have to be able to limit the size of the zap end of the logs, such that the changes can fit in Ram, um which I think that's doable with some tweaks to the design Dr line, um but so I think maybe I'd like to understand that a little bit better um where I was actually going. Was um you know?

A

If you do that, then the then I wonder if the zap is really the right data structure for on disk representation of this, um like you, because you're talking about making some improvements to the zap which is like you know, probably non-trivial, um and does that mean not it's probably not an optimal data structure for this anyways, especially given like?

A

If you are writing the entire thing at once, like that just say, for example like if you're writing the whole thing at once, then you can like kind of pre-compute exactly how it should go in order to not waste any space right, um as opposed to like, with the kind of existing hash table structure, you're going to have some amount of empty space in every leaf block kind of by Design, um and it sounds like you're saying that the undis structure of the zap is what you're going to be caching in memory and that's what's going to get you the fast um lookups right so like when you're looking up in it you're going to go through the arc.

A

That's going to look in the zap and that's going to give you the value from the lookup, um but we don't need to be able to modify that like on the Fly. We can just have like some data structure. That's on disks we've on disk. We write it once now. We we might be doing a bunch of lookups in it and then um and then later on, we're going to throw that one away and write out a new one.

A

Based on the you know, read the old one, apply the the login memory and then write it all out again, um closer.

D

To your original design of but charted.

A

Yeah, something like that or I mean it doesn't have to necessarily like.

D

I'm not necessary or anything specifically it just that concept of yeah just write the whole thing once maybe append to it and then condense and rewrite it.

A

Yeah I mean I think that sharding is going to be necessary to get.

C

A

Be able to process one whole log in memory at once um and then I I think that the the other kind of design decision is is the on disk structure. Also, the the in-memory structure, like that's kind of what you're proposing, is like it's a zap on disk, we're going to have that be the in-memory structure for reading um and then like there's a different structure for the log uh and then I guess maybe there's a third data structure. That's like an in-memory version of the log that you can do lookups in yeah.

D

Right now that's an AVL tree. That's.

A

D

To what that's, how d-dube is normally and kind of uh our in-memory structure of what's in the log is a smaller version of that, uh because it has fewer fields and so on, but yeah. It makes sense to consider some other options there. As far as making it yeah.

A

D

Optimal in memory yeah.

A

Just because, like the zap is designed to be able to do, you know, look up some modifications one at a time um with you know one disk access per thing uh for the per lookup or modification, but you don't care about you only care about lookups. Basically, because modifications are going to the log and.

D

They're gonna have you're going to generate the ability to batch them. Basically, yeah.

A

Yeah essentially like you're, going to want to generate a whole new on disk structure, all at once from from the contents of the log. um So you know a different data structure might be more compact. This.

D

A

Kind of getting into some of the details, but.

D

um Yeah, it's just I, don't know.

D

We still want the ability to random read into the the D2.

A

Table right, yeah yeah, that's the main thing um assume that you're using the same data structure on disk and.

B

Right in memory.

D

Yeah, uh we'll yeah.

A

I'm happy with that and start applying more.

D

Of that to our prototyping and we'll have something to show off, hopefully, for.

A

D

A

um Other other folks have input on the this dedup stuff, which.

D

A

Think is super cool, even though I do not use dedup, but I find the the design of. It is very interesting.

A

All right uh other last uh topics or mentions for today's meeting.

A

B

A

Then I'll see you all in four weeks, looks like that'll, be July 18th.

D

A

You all right see ya.