OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Persistent L2ARC by George Amanakis

Description

From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/1N4drzhggcbgVZ36y5HyNdDuOXTsye1N_/view?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

Good, so I'm going to talk about how we made the uh torque persistent uh before this project was merged. um It is worth noting that the contents of the eldwark were not persistent across reboots. So when you would export the tool or reboot the machine, you would lose the buffers on the l2.

A

So how did we make a persistent l-torque happen? First of all, the ltwork is a cache. It's.

B

A

Device that caches buffers from er and it's worth.

C

Noting that it is a rotary.

A

Implementation, so what happens you can see schematic at the bottom of the screen is that we write buffers and when the device is full, it will start evicting previously written buffers in order to make space to write new ones. So the important thing to realize here is that it's implementation if the device is full okay.

A

So um how did we make the ultra persistent? We first of all to enable the persistence means that we need to restore the header entries of the buffers that uh reside in l2r in vr and to do this we implemented an on-disk structure called l2 arc lock blocks. So this is actually metadata that contain the buffer header entries. You can see there's structure down here, so it has a magic value for them for determining indians.

A

It has a pointer to the previous slot block and then it has the actual uh header entries that are going to be restored in the arc. So um a lock block is written on the l2 arc, every 1022 buffers.

A

Then you can see that it contains things like dva, the transaction birth, the size of the buffer compression algorithms, whether it was encrypted or its data content type. So this is actually a subset of the arc header that we restore when we do the rebuild of the l2r.

A

So this is actually how it looks on disk. We have the schematic at the bottom of the slide. We have 1022 buffers and then followed by a lock block which actually contains the buffer generators of the previous 1022 buffers and it repeats over and over again.

A

So, okay, we have this uh own device structure, but how do we keep track of those lock blocks? I mentioned earlier that each block contains a pointer to the previous, uh so that's how we can keep track of them by information. I mean properties of the block like it's offset on the device, the starting offset of the payload. The allocated.

C

Size of its payload.

A

The the size of the block itself.

C

A

Compression algorithm, it's worth noting that uh normally uh those log blocks are compressed by using lc4. It's uh check some other algorithm and the checksum of course.

A

So we for performance reasons. We don't actually have a single chain of log blocks, but we have two interleaved chains and you can see the the schematic at the top of the slide. The concept here is that, while we are issuing a synchronous rate to read one lot block, we also issue an asynchronous ring to read immediately a prior one.

A

So while we're decompressing and restoring the current plug, we spent the time to read the previous one and in terms of performance, you can see that in a consumer create satellite ssd with a kind of fold by nowadays standards, uh zn processor, um restoring the contents of a 64 giga python to our device. That corresponds to about 100 gigabytes in uh of buffers in terms of logical size. It takes about three seconds if we had just one chain of lock blocks and it takes about two seconds.

A

If we have the current design, so we have a performance gain of about thirty percent. It is also worth noting here that the l2 arc is done asynchronously with respect to pulling port. So we don't actually wait for the rebuild to finish to finish importing the pool, and it is also worth noting that we don't write buffers to the cache device until the rebuild has been completed.

A

So now that we know this information, uh the question was: how do we actually start the l2r rebuild to do this? We implemented another on-device uh structure, uh the device header. uh This is updated each time a log block is written to the cache device and it contains pointers to the two most recently written.

A

So that's how we know which blocks were, lastly, written on the device in order to start the review.

A

So, okay, all this is fine, but how do we stop the rebuild so as not to get into an endless loop of restoring buffers over and over again, you can see schematic at the bottom of the screen, so we we write as you look at the slide from.

C

A

To the right, this is the writing.

B

A

Can see that the most recently written log block is this one, and you can also see that zfs has evicted ahead some space in order to accommodate upcoming raids, and this is uh shown by the effect hand, so the victim is actually uh they often until which, with effective buffers, in order to make space for the new ones. When we start rebuilding the health work, uh we go the opposite direction than what we read.

A

So we go in this schematic from right to left so we'll start at the most recently written block and move backwards and then loop around and once we uh reach the big turn. We stop the repeat process.

A

It is also worth noting that the big hand is stored in the device header, and it is also useful because if a trim is enabled on network devices, then the range between the right offset and the epic hand might have been actually zeroed out. So it's uh useful information to keep in the device header.

A

In terms of module parameters, we introduced two of them, one controls whether zfs will attempt to repeat to the network.

A

This means that log blocks so metadata are still actually written in the device, but once the pool is imported, the l2r buffers won't be restored and the other actually disables completely the writing of lock blocks on the cache device, and this may be beneficial for small devices. This one defaults to one gigabyte. So if the device is uh smaller than one gigabyte, um our l2 resistance is disabled by default.

A

Otherwise it's enabled by default. So as soon as there is a cache device, present zfs will start writing buffers and lock blocks to it. So, the next time the pool is imported. The network buffers are going to be restored.

A

We have also taught zdb uh to be able to read those own device structures, so you can see the device header. Here an example. You can see with yellow the uh offsets of uh two most recently written block blocks. We can also see information like their count, so 28 of them their allocated size and you can also uh inspect the content of the lock blocks. For example, you can see here lb1. This is the first one, the most recently written one. You can see it's compression algorithm. It's check some algorithm.

A

You can also inspect the buffer header entries for each uh of its of its buffer, so things like the dpa allocated size, the earth, transaction compression levels, buffer content, type and so on. In terms of our arc stats, we implemented two sets of them. uh One set is updated online, as writes are happening to the cache device, and you can see information like the number of the log blocks, their allocated size and the other set.

C

Has to do with the.

A

Rebuild process, so it tells us whether the repeat process was successful. How many lock blocks were actually read.

B

Have been read.

A

And in terms of the size of the buffers, how much is that, if you inspect the history of the pool with zippo, you can see if the outwork repeat was successful and how many blocks were actually restored.

A

So with that I'll be able to be happy to answer any questions.

B

Yeah george, you should be able to see them in the q. A yes.

A

I see them so nick is asking what is the largest rock devices this has been tested with, so personally I've gone up to 128 gigabytes, um I'm pretty sure I was. I had seen a couple of pull requests in github that mentioned hash devices up to 2.5 terabytes.

A

So unfortunately I don't have one of those. But yes, so in personal testing, up to 128 gigabytes uh jan asks. uh How does network persistent impact put times? Does the pool import block until the l-torque has been repopulated? No, it doesn't block it. So it happens. The network rebuild has is happening asynchronous in the background, and this does not impact importing importing of pool asks. Are there any operations that rebuild blocks until it finishes? No, um it doesn't so uh it is.

A

It is an asynchronous operation so pulling board, or even in the case of offline online cache device.

A

It shouldn't be blocking anything. This has also been manually tested by you know, putting uh manually delays in the code make sure it works as intended. So um we haven't seen anything.

A

And becky asks so is mirroring the outwork not necessary now. So this is an interesting uh uh concept. I think I don't think that the uh the zfs allows the l2 arc to be mirrored.

A

You can have multiple cache devices in a single pool, but this is right. These are as far as I could tell from the code or the discussions we had. I don't think a zfs in its current state allows mirroring the torque devices.

A

If some someone else of the more death people, the zfs, has a different uh view. I'd be happy to hear it, but I don't think mirroring. The rock is possible.

C

Yeah, I don't think so either um you might be thinking about the log yeah. I should just post issues.

A

C

The log device is kind of totally it's also an auxiliary device. It's also a way to like get better performance by using faster hardware, but uh you know because the l twerk isn't required, for you know the integrity of the pool uh we didn't bother with uh monday mirroring for that, but yeah for the log you would. You would still want to have mirroring.

D

Yeah, I was going to add it. It might, you know originally, because we didn't have persistent l2 arc. It didn't make a lot of sense to actually build any any redundancy now with persistent ltr. That may change for certain people, and it may be a feature that you may want is the ability to have some redundancy for your l2r just because you want to be able to, like. um You know, preserve that in the event that something fails and you you know want to like reboot or something.

E

Yeah when we implemented l2 arc um at oracle back when we decided not to redundancy and simply because we viewed the actual data storage as the redundant component of the l2 arc. So you could always find the data back on disk. If you needed to get it.

A

Okay, I have one final slide to show you so go back to sharing my screen.

A

So uh the stats of this project, it has already been merged, it is in master branch and it was in the upcoming opencvs.

A

This was a teamwork, so the implementation for the persistent network started in illumos. I think five years ago.

C

A

And shas is the one who did the original work, then it was later reported on to cfs and linux by yushon, and I actually picked up on the code that jorgen had lying around in the cougar quest, so I'm very grateful for them and also for everybody who provided feedback reviewed code. I've listed some of the people here uh only know. I think it took me about five months to complete this work and get it merged, and the support of the open cfs community had been has been great.

A

So thank you to all those people and thank you to mike again for providing this opportunity.

C

Cool it looked like there was uh one or two more questions. Oh yeah.

A

There are a couple.

C

If you would like.

A

So christian is asking: is there code using the cio pipeline? uh Yes, I think I guess I'm pretty sure it does so. uh The rights happening uh are through uh the cio backline yeah uh holding the receiver in the twerk. You get all performance gains, so we're not actually receiving the outward. uh We just go and read and restore and mark the metadata, the buffer header entries of the buffers that lie there.

A

So we don't actually spend time check something I mean yes, we spend time check something and see if they check some of the lock blocks much and if they're valid. uh But at that point it's only the lock blocks. We care about not the actual buffers xenifes when one's invest wants to uh to read a blog.

A

It's it's that time when it will do the checksumming and decide if the copy that is in the l2 arc is filed or not. So this doesn't happen for the buffer itself. The checksum control doesn't happen at the time of the rebuild at the time of the rebuild. All we care about is if the log blocks are valid so as to go ahead and restore.

A

uh Entries so uh no, I don't think that this has an impact in uh performance gains and james is asking if rights to the pool happen before the outwork is populated after a report. How is this kept in sync, for example: l2 our contacts don't mention this anymore. Yes, okay. This is a great question.

A

So in terms of this, if say, if the contents of the disk change- but there is still information on network that is not up to date- can cfs will become aware of it if this buffers or if these blocks are actually red. So it's then, when zfs will see that okay, we have information in the eldwart that doesn't match to check some much uh what's on disk, so it will go on and fetch from disk, not from the outwork.

A

So in other words, when contents on the disk on the disks are updated, it doesn't mean that the contents of the ltwork will be immediately updated, pl to our caches buffers from the arc. So.

A

There is a specific parameter that tells the zfs how far to scan within the arc for l to arc cachable content. So.

A

When, when that area of the arc is scanned, it's it's that's when the l truck would get updated. If, again, if the points on the disks are updated, it doesn't mean that the console benchmark will be automatically updated. If there is a discrepancy, a zfs will opt to read from this and not from the l2r company.

B

Thank you george. um If anybody else has more questions, uh we're going to be having the breakout session uh following right now, so thank you, george, for your presentation, great.

C

Thank you awesome thanks, george, and thanks uh to whoever asked that last question. I think that's a definitely interesting question, uh especially because I think we try to make it so that, in the normal case, where there's no hardware problems, we we would like for zfs to not be relying on the checksums in order to get correct.

C

Behavior like the checksum is just supposed to be there to check that the hardware is doing the right thing, um so that would be an interesting area to do some more investigation and enhance the outwork so that we could know for sure that uh the data that we're reading should be the right data. Assuming that the hardware is okay,.