OpenZFS OpenZFS Developer Summit 2016, 10 Oct 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scrub/Resilver Performance by Saso Kiselkov

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so I guess this talk is gonna, be a little bit talked about people who have had frustrations, and it's one of my primary reasons why I work on ZF is I'm basically trying to solve problems that I'm encountering just a quick question. Raise your hand if you're running a ZFS pool substantial size, let's say 50, terabytes, okay, which ones are. You are running scrub, every silver regularly, okay, about half the hands and which ones are you enjoy?

A

The experience you guys are crazy anyway, so we understand that the problem basically, is that these operations, scrub, Andry silver, are take a really long time and to many people it would seem that they take a little bit too long to be in any way sensible. So first gonna do a quick design recap of people who are maybe not quite familiar with the object model and ever and the general model of ZFS, and so as a quick overview. Basically, the way we can view ZFS at the object level is it's basically a flat database.

A

You have there's the hierarchy itself of the filesystem is built up, as was previously discussed using an upper layers EPL, but the, but the actual base structure of the file system is flat in itself, and it's it's based around the notion of constructs that we call objects and the objects are, you can think of them as basically like a flat playing file. It's just a just a just an array of blocks with a number. It's it's identified by number and we sometimes group these things.

A

We always group these things into things called object, sets and yeah. Basically, that's that's the way ZFS looks like at the at one of the lowest layers and the thing that implements this layer is referred to as BMU.

A

The keynote and key thing to note here is the DM. You doesn't really care about the structure the objects inside all it knows. Is it only understands enough to be able to find all the objects and read their contents, but it doesn't really care whether there is any linkage between them or any kind of any kind of structure to them. It understands only how to get to the objects how to get to get to all the parts of them and how to check someone make sure that they are all consistent.

A

So everything that you see in ZFS as a user is built as an object of any plain file in the directory, any even sim links and attributes on objects as evolves. Everything is built out of these object. Nomenclature out of these object things with each with a specific binary format inside, but basically that's what we're looking at and so graphically. This is about what it looks like it's a little bit of simplified view, but you can view these things. These denotes as essentially your objects, and each of them has a specific data type.

A

Each of them has a specific piece of data and structure inside, and then we group these together into these things, called object, sets so they're, actually I'm going to use one more slide for that in order to describe what's going on. So these objects are too many file system gurus.

A

They know that the only sensible way to design really such a such a data structure is in kind of a tree like thing with indirect, with lots of indirection in between, and this allows us to not have like a fixed size of an object or be able to like punch holes in it have pieces of it that are missing and things like our copy and write structure so we're we are able to reallocate a file. It's not just you don't have to have a file on disk as a contiguous thing.

A

You can just move it around bits and bits in there, and the only way to do that sensibly is to have it as a tree of indirect blocks. Then refer further down further down everything, and so these indirection levels in ZFS are numbered and they're numbered from the bottom up. So at your lowest level, zero. Is that that's where actually the objects contents live. This is frequently referred to as user data or the actual useful data ZFS, and anything above that that's higher than level zero. It's your metadata!

A

So that's the thing: how to locate the actual contents of the objects, and so the important part here is the disparity in amounts of data, and this little picture here does not really capture that all that. Well, you have to understand that the upper part, which is all the in directions and basically the levels of beat levels of the tree that are not.

A

They don't contain your user data that user data is down here, but the upper parts up there is just in directions into the lower level blocks, and you got to keep in mind that this is a tree that expands out very very quickly the upper layers they're a really a fairly small amount of data volume. It's usually only about 1% or thereabouts. Most of your blocks on disk are gonna, be these very, very bottom level of blocks.

A

So that's where pretty much all of your used space usage lives, and so the dmu is also the thing that implements the ability to scrub and resilvered and what they do fundamentally scrub and resealed, or do the same thing. It's the same, algorithm pretty much they all they do. Is they just go through all the objects? Read all the blocks and they just check the checksums scrub memory. Silver does not care about the structure inside it doesn't know anything about directories, it doesn't know anything about symlinks or hard links or anything.

A

So that's why some people, who think that ZFS has a sort of a checksum in terms of white telltale, just I'll just run scrub that'll check the file system. It really doesn't scrub only ever verifies that the stuff on disk is checks out with the checksum. It doesn't care about, link counts, it doesn't care about loops and directories.

A

You can create all that if you wanted to with the appropriate editing tools and so resilvered, just a variation of the idea of scrub in that, if you have a broken drive, we will simply reconstruct the data on it and yeah. So that's that's the basic idea behind these two things now. The important thing note here is that these operations are performed in order on a given object.

A

So if you have a file that consists of two terabytes data in sequence, it'll just start at the start of the object and just pretty much work itself work its way forward, and so the algorithm to summarize it in a really quick way, is just grab an object. Read through all of its logical blocks. In sequence and logical I mean here, biological I mean the the way that the block is represented inside of the object and then just grab the next one, in repeat, repeat, until you've passed over everything on the file system.

A

Now this is all very good if the file system, if the on this layout, looks pretty much the same or runs approximately in sequence, to how you have it in the object itself. So if we yeah so this is just another representation of roughly what it will look like in the tree in the tree sort of layout.

A

So you would jump up to a very high level where you're talking to where you're, basically looking at the top level of indirection there, and then you recurse down your curst down recurse down and once you reach the very bottom layer, you would basically issue a zio read that would get sent right away to the drives and you keep on going. You just keep on following this tree structure until you've consumed everything in sequence in the in the object itself issuing any IO as you go now.

A

This is all working at all, very well, if you're on initial layout, so this is this is sort of two scenarios where you've just about initially written an object. So let's say that this object was written. It's got seven blocks in it. You've written it in two sets like blocks, one two three and at some point later, you've written blocks, four five and six and seven, and so usually what ZFS will do is it'll try within a reasonable amount of times you've written chunks. At the same time, it'll try to write them in sequence.

A

After some time has passed, maybe there's some a little bit of data extras accumulated on the pool, and then you write out the subsequent portion and so from the point of view of the resilvered. What it'll see is a little issue reads for blocks, one through seven, pretty much in sequence and the disk. What else the the well? What they do we'll see is they'll see blocks 1 2 3 skip 4 5 6 & 7, so that works reasonably well. This is on your initial writing.

A

When you've just filled up your filesystem and you're gonna be running scrub already silver. Now the problem is what happens if you've rewritten lots of data if you've got a gun in so your typical database use case database comes in modifies, a block goes away, goes in again modifies another block goes away, and these things tend to basically spray out the physical layout of the file over the disk and pretty much random order.

A

So now, when you scrub, when you read the object in sequence, you'll you'll end up seeing blocks: 1 2, 3, 4, 5, 6, 7, you'll issue, all the reason sequence. But what the disk will actually see is it'll see: okay, I got a read block 1, then it'll do a seek block, 2, C and so on, and so forth and Paul is just currently checking the layout there. It's completely good, but yeah pretty much it'll just jump around a little bit.

A

Yeah disks are able to, to a certain degree reorder reads, but this will basically, if you have a large enough object, little overpower the ability of a disk to reorder anything and you'll end up you'll end up converting your initial you'll think that you're doing sequential reads on the object. Because of all this rewriting what you'll end up hitting the drive ass is random reads. Now it's obviously a little bit of a problem for performance, so the improvement here is obviously to try and not do that.

A

So basically, the question is: how do we get I Oh back into order and the way we've done it? Is we split up this scrub in Ori silver into two sections? We first essentially, we scan the as much of the data set and these things are not they're, not sequential. They are pretty much operating in parallel, but we split up the process into two sections. We first scan through and try and discover as much of the end user data, which remember, is 99% of your data.

A

We try and discover as much of the locations the blocks on disk. Then we do something to them and then we set them off to be you read or resilvered or anything. So in order to do that, we have introduced a per top-level vida, reordering queue, because when you're scrubbing or riesling really don't care and what sequence the blocks are being rebuilt as long as it's all done by the end, so we just reorder everything and the the iOS are being the IO that have been generated.

A

Doing scanning face will get queued up, reorder aggregated, so we know which ones are close together, which ones aren't and then we'll issue it in a relatively sensible sequence. So this is pretty much how it looks like after after we introduce our changes here for the improvements you come in from the top you again, the algorithm for scanning is unchanged. So from the point of view, if the code base, it's really not that much of a change.

A

What has changed is that once you read the level 0 reach the level 0 blocks, we don't then try to issues the aisles right away.

A

We sort of record them somewhere and the queue then I'll be talking a little bit in a little bit in a little while about the queue how it works, but the queue takes care of making sure that stuff is as much in this quarter as we sensibly can make it and then, at some later point it'll start to issue the iOS, seemingly out of sequence, from the point of view of the file system, but they will actually be in sequence from the point of view of the drives and when I'm talking about caching CIOs here we're not actually cast caching, the full zio T structure, we're just caching just enough information or to be able to construct the cio, because the zio T has a lot of information in that we don't really care about.

A

So this is a kind of view that we have from a system. Topology point of view we have a top-level Vida and each topple v-dub gets its own key, because each top-level Vida is pretty much the thing that matters about disk layout each. So we track at perb top level Vida and then reorder for it for the particular disk, this sequence.

A

So how are the queues implemented? The queues primarily track a keep track of two things. They keep track of. The individual reads to be issued, so the CIOs, although, as I said, it's not the actual ciot structure and the reason why we keep that around is because the CIO is contained, the block pointers or they tell us about the block lenders and the block owners tell us that check sums of the data that we got to check.

A

The extents are the second kind of thing that we track an extents are, from the point of view, this set a change from the poem brief. Improvy silver is just a loose collection or well.

A

It can be actually a fairly tight collection, but it's a collection of reads that are close together and each in sequence, so that we know that they sort of represent a good target of opportunity where it could go in and issue a lot of a lot of I/o in good sequence and the the sorting queue itself consists of principally three trees pretty much and the they all sort of interact with each other. We have a tree that tracks each extent sorted by the aggregate size of the constituency I/o.

A

So we know how large an extent is and how much of it is actually filled with CIOs. So we know roughly how valuable it is, and at the front of the tree we have sort of the biggest chunk is targets that we want to that. We might want to work with. Then we keep obviously extensive tracked by address on this, so we know how to assign CIOs to them as they come in and the CI. Then, of course we keep track of the individual CIO senator to be issued.

A

So this is sort of a layout roughly what it looks like from top to bottom, be extents by size by address and then finally, the CIOs by address, and as you can see a little bit over here, we can see that one of the extents we call them scan X second segment R. They don't have to be necessarily completely filled with CIOs. We do allow for a little bit of inter zio gap because drives are pretty efficient.

A

That's skipping that, but if the gaps are too large, we just consider that to be a separate, cio, a separate extent, and so obviously we aggregate those into these larger extent structures, and then we sort them by size so that we know which ones are the juiciest ones and the algorithm for that is a little bit more complicated than just we sort by size. We actually also do. We do also consider how well filled they are so, for example, this guy over here, even though he's fairly large, he has a few chunks missing in it.

A

So we could. We do wake that a little bit in the algorithm. So do we know that, for example, somebody who's, a couple bites shorter, but completely filled up. We do consider that one to be more valuable than the ones that are like full holes, so say somebody so say for existen is EAJA that comes in so over. Here we got a new CIOs being about two. That was a request to be queued. You can see that it bridges the gap between two extents. So obviously, then we got a reconstruct.

A

The extents will join these two together and then, of course, this is a new new large extent, so it'll get put up in the front of queue and it becomes our primary sort of most valuable target. Now this all talking about how to how to sort the I/o. Now, what do we do about issuing it?

A

The way it works is pretty much. This is many of you will probably recognize that this is a classic sorting problem and that you can sort as well as you have as long as you have good good amounts of memory. You can sort anything right, but the problem is that, usually you have a lot more data than you have memory to sort with it so because this is essentially a metadata problem, you're trying to cache metadata instead of RAM in order to do the sorting thing.

A

So we track all the queues and we do understand how much memory they take up and we try to limit it to a reasonable value, although its tunable by default. We limit it to 5% of your physical memory and if the queues bureau, just too large, will start to issue your largest extents at the front of the queue and during while issuing the zio reads: we'll actually pause the scanner part, because the scanner is essentially doing random, read I/o and it really impacts.

A

If you have, if you're trying to do sequential reads so, basically, that's how we again contract the queues and we try to hover it around the 5% physical memory target.

A

Obviously, the more memory you have, the better it works or the more memory you dedicate to the thing, the better it works and I'll show that in a couple of benchmarks in a moment now, one thing that this really impacts on is the ability to resume after reboot or you have a machine crash, and you want to resume your resilvered or scrub. So what do you do? This kind of design, that's kind of out of order thing really tends to mess with the old way.

A

The CFS thought the the progression records essentially on disk and for scrub. It's really not that big of a problem, but if you, if you miss some some transactions from your trim, it's from yuri silver, that's gonna be a bit of a problem, so we do it by essentially in certain periodic intervals. We by default, have it set to about once an hour once an hour, but it's tunable. We basically just pause the scanner.

A

We stop the whole thing and we tell just go through all the CEOs in in sequence and just issue them all clear out the entire queues.

A

This will take a couple of minutes, but the point is that, after you have completely emptied out the queue and basically was a simple elevator algorithm, you will at that point have completely consistent state between what the scanner think is as fast and what the readers have actually issued.

A

And then we can update the DSL scan fisty structure on disk and you'll be able to resume from this point on for work, and then you can restart the scanner and keep going filling up the queues and issuing CIOs out of sequence, CIOs reordered and, as is necessary. The point is that scrubs and resellers typically do take a fair amount of time. So if you lose about an hour's worth of progression and if it's a 12-hour job, it's it's somewhat of a deal. But it's not that big a deal.

A

If you were to loose all 12 hours, that's bigoted, that's kind of a nasty thing. So that's why we we've both in this kind of mechanism to be able to continue and in in the end. It doesn't take that much of a toll and performance. It's maybe 5% of performance loss and the important thing is: it avoids any kind of disk format changes. So this is complete a complete code change. It doesn't really change anything on the disk, so you could just install it.

A

Try it out if you don't like it, roll back be happy with it. So in terms of numbers, what are we looking at? So this is a completely synthetic test, the best kind of or I guess, worst kind of situation. You could look at you could look at so this is basically a 5dr raid-z running on two and a half inch 10k RPM 300 BBS drives and it's filled up and I've, basically thrashed it completely with VD bench. It's randomized to the point where the regular stock resolver was running at about 8 megabytes.

A

A second yeah we've taken about two and about a day and a half to complete with this improvement and the queue size parameter that you see here is basically the value that we had, that I've limited the queue to to max and grow to a maximum of two hundred twenty sixty two megabytes. The reason why that's that kind of a weird number is because the Machine I was testing it all had 256 gigs of RAM, so I just set it to one percent or actually point one percent.

A

So on this kind of machine at use, it got an improvement of about 1600 percent of improve performance yeah. Realistically this this is sort of the worst case scenario. The database guys are probably smiling because this is pretty much what their data sets look like. They are completely randomized.

A

Everybody else is probably gonna, be a little bit closer to this kind of a number. This is general-purpose file, store server that we have in Ascenta about 26 terabytes of data running on 10 8 terabyte drives and the regular reso takes about two days or scrub. Actually, in this case it's easier to test scrub than re-solder, because you don't have to take tribes out, but yeah. This pretty much was done to test what the queue size parameter effects.

A

So if you set it on a twenty six, terabyte drive, I think the general recommendation, by the way, in terms of memory to data storage, I, think we recommend about one was that 0.1%, so per terabyte of storage. We recommend about a gig of ram, so this thing had had about twenty six terabytes of data only took at one point three gigs of ram. It was still three point four times faster than stock.

A

If you give it basically five times the amount of memory it'll it'll complete a little bit faster about an hour faster, but that basically this was to test the effects of how much memory you can throw out the thing pretty much the finding here is that if you give it just a little bit of memory and you, if you get let it do at least a little bit of reordering you'll get massive performance gains, at least on spinning rust, and the final test I have here, for you is a test on SSDs.

A

Theoretically, if you work just you know, you sit down at the table and think like. Does this actually improve SSD performance? The answer would be no, but apparently there is a little bit of a gain here. This was tested, I keep forgetting the optimist.

A

Ssds the SAS SSDs through six of them in there put on some data randomized it and SSDs generally do take a little bit a lot better to randomize crap being stored on them, but still, if you reorder the their reads in sequence and give them a little bit of leeway here, you still get a little bit of a performance when, even though it's not quite that much so yeah, that's pretty much it. If you have any questions or and/or Rotten Tomatoes feel free to fling them. My way sure yep.

A

Yes, the locking protocol is as soon as you want to touch one of them. You gotta touch all of them, because.

A

Yes, no, the this is all protected by a single lock. This is one data structure yeah the the one Q one Q is always locked in one piece because, as you can see here on the modification as soon as you add, a zio you've got to modify the middle part and then you got to modify the top part. So all that happens in one step. It's one lock.

A

Yeah this, this kind of this kind of a win is pretty amazing. I agree.

B

A

I'm, so so the question is how badly trashed was this data, and how did we? How on trash did we get it yeah? This was completely trashed, I, think I've led BD bench. This is 32, K block, size and I. Think I led BD bench run a 100 and Emre on this for about a day and a half, so it's completely gonzo and the Cuse, the the the critical part here is that the reordering part does not do it all in one go.

A

It basically tries to do as good a job it can with reordering reads until it hits the memory cap and then it'll, sort of okay. Well, we just gotta pick the best target and we just clears out one of the the very largest ones until it gets about I, think about 50 megabytes, underneath the limit and then again gross and basically keeps that around now. The nature of this beast is that such that, as you progress further down as you get basically to the end of the pool with the scanner.

A

At that point, you got to start again grabbing the lower value target ones that are shorter. So the final push to get the thing finalized is gonna, be a lot slower than the average up until that point. So this took about two hours May, the main, maybe 95% of the data volume is gonna, be done about in about an hour and a half and the remaining 5%. Is there just a really scramble around bits that you're gonna be the elevator in through once at the end? That's gonna be taken about a half hour.

A

So in the ends this this is the final average that you're seeing here the the initial thing. When you first run it, and then it hits the queue size. It starts. Issuing you're, just gonna, be some all smiles, because Lola Ron had pretty much drive, drive speed, but that's only because you've just picked out the largest contiguous chunks that you can find and you'll start issuing. Those first sure.

A

A

So the question is whether the founder says right.

A

Yeah, so the question is whether the iOS are issued out of or inside of, sync in context, they're issued out of sync in context, so the waiter works normally GFS, resilvered and scrub. They issue I/o only inside the syncing context, and they only scan ahead in in sync in context. The way this works is once we hit the memory cap or we hit the the time limit for a basically a checkpoint.

A

That's when we set a little flag and whenever we get to the scrub scanner inside of syncing context, we look at well just it's it's it's checkpointing at this point is just move on and then there's another thread that, at the back on the background, is just chugging along as fast as it can to empty out the queue thumbs up for Matt, Aaron's sure.

A

Per leaf be dev would record. The question is whether why the cueing is implemented in top level, V, devs and non-acute, and not only few deaths per leaf, be dev.

A

The problem is the the reason why it's in top level Vita is because that's where your sequential your your ordering of data becomes important on disk, even a raid Z, which seemingly seems relatively scrambled around, even if you have like contiguous pieces when you have contiguous blocks on a raid, zv dev of top level, vita it'll still break apart in sequence, on to the individual constituent tribes. So that's why I was put there changing the queueing strategy at the leaf, I'm, not sure the easy to do there.

A

A

And the trick is that, in a rate, even in a raid-z when you have to continue or three contiguous blocks, you know when you take, for example, for the first Drive only in a raid Z when you take basically the first portion of each stripe, they'll still be in sequence, so the sequentiality does not break in a raid Z between leaf and top level.

A

A

No, we don't, we don't do anything on leaf, beat us I.

A

Suppose so, yes,.

C

A

The question is how large of those gaps and what was determined through some super complicated processor was just the magic number that we pulled out of where the Sun don't shine.

A

It's pretty much a straight up up my sleeve number, its tunable. So it's not hard code in the code. I've tested out a number of values and on a number of pools, but basically it's tunable at this point. We might fix it if we determine that, there's really no reason for it to be tunable, but I don't really see a need to remove that tunability I got to take a question for up there somewhere.

A

Yeah yeah I know: okay,.

A

A

As soon as I fix one particular bug in it, now it's a it's pretty much feature complete. The only thing that I got to do is there's a little bit of a data. Corruption bug there, where, if you try and interrupt re silver and then boot to an old machine, it could just cook the pool, but but it's really it's really just about finishing up one of these things and the bulk of the algorithm is done, and we could just go over it tomorrow and in the hackathon.

A

If you want to any any willing pull anybody willing to sacrifice a pool to the gods of the improved resilvered change is just come. Talk to me sure.

A

Right so the question questions are: how does it affect? How does it compare to just sequential reading and how does it affect how this? How is this thing affected by other I/o happening on the pool?

A

The question of how much it compares to line rate on the drive is pretty much a question of how much memory you can throw at it. If you give it a little, if you have a very large data set, and you give it a little bit of memory to implement the cues basically, then we cannot do as good a job of reordering the cios, because you're just gonna have just gonna have a whole bunch of extends very large extents with lots of holes in them. So obviously you're still going to get some skip.

A

Egde then the performance is gonna, be lower, it'll still be pretty close, it'll be best if you could get it like just unlimited amounts of memory, so you have like a 2 terabyte drive and you just let it have like 20 gigs of RAM you'll, basically cache all the metadata on the pool and all of your reading is gonna, be at 0% progress and your scanner is gonna, hit, 100% and then you'll start issuing everything in sequence, then you're pretty much gonna get as close to line rate as you can given any fragmentation on the meta slabs.

A

The second question was the second question about again right. If you still do I owe to the pool, how much does it affect it? It affects it, obviously, in a fairly negative manner, in the same way to affect it be affecting regular scrub. We issue the CIOs in exactly the same priority and queuing status as regular scrub. So it's more of a question of how much does other I/o happening on the pool effects scrub in general. It affects it to the point where you allow how much you allow it.

A

Basically, in how you how you set up your priorities at this point, scrub is fairly heavily affected because scrub is a low priority. Resilvered are in a much higher priority, but still I mean it's. It is gonna, suffer you're. Gonna have two concurrent workloads. It's it's not gonna, be maybe I, don't know, dropped out of 5 percent, but you're gonna get a drop if I don't know, maybe 50 percent, because other areas taking priority any questions. Sure.

A

Yes, that's the hope yeah. So the question is: if the if this is gonna, be requiring less I ops than doing and conventionally that up to this point, yes, it'll require a lot less I ops, it'll still do the same volume, but I'll just do it in another sequence, preferably completely in sequence, that'd be best sure.

A

Yeah, the question is whether this can be used to rebalance data I guess across ETF members I was waiting for when this year the obvious BP rewrite question is gonna come up? No, it cannot be used because scrub does not. Maybe people imagine that scrub arre silver when it discovers a bad block, it'll just reallocate it somewhere. It doesn't really cannot move data around.

A

Yes, but but but keep in mind that, as I said in the initial part of the talk, we don't look at the structure of the data inside so I, don't know if what I'm looking at is a bunch of block pointers or it's just your Shakespearean novel, so I cannot move it around and if I try to move it around, I might break checksum or I might break block pointers that are in a completely different object pointing to this one. So that's why we wouldn't want to do that. This BP rewrite essentially you're implementing.

D

A

Yes, so the question is aside from checkpointing: is the read? Gonna sit in queue? Yes, it is, but you have to understand that the read is kind of a special read. It's not a read from user space, it's a read from only the scrub part.

A

Now the question is whether it's gonna affect Linux and no it's not because the threats not gonna timeout you're not causing any threats. The way to work says is essentially, there are two threats: well, actually, there's one scanner threat. The scanner part is actually not a threat in itself.

A

It's just done as part of the spa sync in Spain syncing context and the read threats are all individual threads per top level be dev and they just sit in a CV weight and a CV timed weight, and once they have enough data to be either kicked off either a consume, the top level most extent or start everything elevating through everything. Then that's when they'll get woken up so you're, not gonna. There's no blockage, no obvious blockage, at least I, not that I'm aware. So.

A

As long as Linux has cv time, the way implemented right, you should be good.

C

A

The question is: how much droppage is there once we hit the memory cap when we drop a full percentage point or just a little bit, we try to keep it as little as possible, so we will try to drop by maybe 10 or 20 megabytes under the limit. We.

E

A

To be switching over from scanning to reading too often, but as long as you do it like every couple seconds doesn't really matter right.

A

Yes, so the question is: how question is how do we progress report, the thing there's a new value, so the way the current progress reports work. If I remember right is you get a scan? Stat scan, stat T in your zero configuration from the kernel we put in another value in there, which is basically read progress and kept the old one as the scan progress, and it's also a modification of 's equal status of the printout and you'll get two lines on there.

A

It's one line, I, don't remember exactly which one you'll get basically two printouts there. One of them is scanning progress and the other one is reading progress. So you can easily end the time estimated it's based on the reading progress. That's been changed so that you don't get like bogus values like I'm, remaining zero minutes, 0. Second- and it's still not done so.

A

No question is: where do we store basically persistent state of this? How do we? How do we keep track of? What do we resume? That's already being done by regular zpool scanning and scrubbing and Riesling. It's part of a structure called DSL scan this t, + /, /, sync and contacts. You basically issue this.

A

This kind of you basically write up this write out this record that'll tell you I've, scrubbed or resold ur up to this TX G and then tap TX g and that TX g, it's basically keeping that kind of progress nd it does a little bit more. Does dxg and object, set number and object number and offset within the object and so forth, and once it gets around back to the next syncing context, it'll pick that up again look at where it was and keep going from there.

A

So it does it it sort of stops and starts the only chip. This only changes it changes it insofar as we don't write out the DSL scan feste every transaction group number, but only once we are done with the checkpoint and we have cleared out all the queues. So we are certain that we've gotten up to this very point: that's when we issue the right in syncing context, and then we keep going with the scanner.

A

Know not to scan persistence, objects.

C

A

It doesn't yeah so we're not changing on this poor man. Yeah.

A

Yes, that's basically, your you're describing the common problem called broth pointer rewriting, is that block pointers themselves are not the check since the block burners are not in block pointer in the end it blocks, but in the pointers to them. So if you rewrite anything that contains a block pointer, I gotta finish up this last question. So whenever you rewrite the block, that means you've changed a checksum, but there's a myriad of places that can still refer to that.

A

So you got a goal these place to all those places, modify those and there's other places referring to those places and you're gonna, basically chase chase the faucet some tree all the way back in the wrong direction.

A

A

So pretty much for any any rewriting of a block that's already committed. You would have to then pretty much scan the entire file system and find all the occurrences that have been referring to it and rewrite those and then go back again. It's basically an exponential kind of problem and makes file system people's heads explode.

A