OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting

⏯

youtube image

►

From YouTube: dRAID, Finally! by Mark Maybee

Description

From the 2020 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1uo0nBfY84HIhEqGWEx-Tbm8fPbJKtIP3ICo4toOPcJo/edit?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

V-Ray, finally, well almost finally, um so I'm going to talk about d-rate uh and where we are with it um and give you an idea of when you're actually going to see d-rate here soon um and why it isn't really. Here we were hoping for 2-0, but we didn't quite make it um so the history here uh again, this is a feature that was originally developed at intel by isaac wong. um He gave talks on this uh at the open cfs summit in 2015 2016 2017.

A

um It kind of languished for a little while and then was rebased by don brady around the time he was doing um allocation classes.

A

The feature was picked up by cray in 2018 because they wanted to use it as a key capability. In a zfs version of its distributed storage product um brian bellador adopted the crate version of the feature uh in early 2020. He created. What is the current pr for the feature now, uh and you know I have been working on this now uh for a number of months trying to get all the pieces pulled together and and get this.

A

A

I don't I get there. It is sorry all right, um so I want to introduce a little bit of terminology before I start in mostly because there's a lot of confusing terminology uh here um and that can get easily conflated between raid and and d-rate. um So I want to talk about group size group. Size is the number of pieces that uh or a number of columns that data is partitioned into plus parity.

A

um So you know, obviously, the number of of data columns sort of defines how much overhead your redundancy is going to cost and the amount of parity uh determines. How much redundancy you have um d-rate size, then, is the number of drives that are actually used for storing data within a d-rate configuration.

A

It does not include the spare drives. So when I talk about d-rate size, I'm talking just in the data drives or the data drive capacity.

A

A d raid row is a 16 megabyte chunk of space allocated the same offset across all the drives. uh In the d-rate configuration and, for example, row zero is offset zero through offset 16 meg across all drives in the config and you'll see why that's important here? A little bit and a permutation slice is one or more rows which are permuted based off of the red permutation array, and the actual number of rows is derived per slice is derived from the least common, multiple of the group size and the d-rate size.

A

So keeping all that in mind. um What is d-rate so d-rate is raid in our case, in gfs cases, raid z, declustered.

A

So what that means is on the left. You see here an example of what a raid z layout might look like where you have two raids top levels with hot spare allocated for your pool and the equivalent or the contrast of that to as a d raid would be to say. I want a d array defined with five a group size of five across these eleven drives, with one of the the drives being uh or drives with a spare capacity.

A

So in the left or in both sides, you see I've. I've divided the the the tables uh into rows and these rows are what represent a both in this case, both a group, uh a permutation slice and a d-rate row all right. So each row is permuted from a traditional layout across drives in the left and right z to a semi-random layout on the right based off of a computer permutation that allows us to get relatively randomly out of our data across the entire pool.

A

So why is that.

A

Important, I'm really slow and advancing on this for some reason.

A

Okay: um let's talk a little bit more detail about raid z versus d-rate.

A

First of all, the d r implementation is actually layered on the raid z code.

A

It uses a different map allocation function, but its pipeline, its functions, are almost identical uh to the raisy code and actually calls into the raisy layers to do its. I o um the raid z layout in a razi layout. Each group is constructed from a set of physical drives and the columns of that extend the entire length of the drive.

A

So the only constraint you have is that allocations into those d-rate sorry, raid z groups, must be a multiple of parity plus one. uh This is to prevent stranded in space, uh which is so small that you can't actually allocate from so anything that's smaller than parity plus one is not allocatable uh in a racy layout.

A

So, for example, 1k data requires 2k of space with a single parity d-rate layout. On the other hand,.

A

Groups are divided into rows of 16 megs, rather than having consuming the entire disk as in a column, each column is chunked in 16 meg pieces, where contiguous 16, meg chunks, don't necessarily reside on the same drive. This is the permutation, um so group rows are non-contiguous on physical drives, and here allocations must be a multiple of parity plus data, so not just parity plus one, the parity plus data.

A

That means that, for the same example, a 1k data is gonna require two and a half k of space in a five disk group uh size group, uh and if the group was larger, it was an eight disk group. You have to have another three uh records set three uh 512 chunks in that allocation uh in order to align it appropriately to the d, d-raid constraints and I'll talk exactly why that is here in a minute.

A

So why do cluster there's a couple of very good reasons? One spare drives are leveraged here. So in the example I showed before the first example, we had our hot spare that hot spare is a spindle that sets idle most of the time until there's. Actually, a problem needs to be utilized. It's never used except to be ready to go in a de-clustered configuration.

A

We distribute the spare capacity throughout the d rate, so this the the actual spindle, is leveraged with a combination of data and and spares capacity and every spindle is the same. That way, um each group's data is randomly distributed across all the drives. This decouples the redundancy group from the number of data drives so that, as we are allocated reading and writing, we can consume from all the spindles rather than just a set of spindles which make up that particular group.

A

um So the data reconstruction can leverage all the drives reads are shared evenly across all the drives rights go to spare blocks which are distributed across all all the drives.

A

This allows us to use all the spindles all the time and, most importantly, sequentially silver works in this configuration, and that is because of that constraint of all rows must be full rows I.e. We always have minimal allocations of parity plus the data, so we always fill the entire row. This way we can construct our pseudo blocks for the d raid, because we know exactly where the parity is going to be laid out and what the data.

A

So, how do you go about? Creating a d raid in zfs, um we had a new top level device uh type is called a d-rate very similar to a raid z. um We support grid one two and three for the various parities.

A

You can specify at creation a redundancy group size, so the colon 5d example here, d raid 1 colon 5d- would be a a group size of six five data. One parity, uh the default for the group size for the data portion of the group size, is eight.

A

um In general, though the group size has to be less, the number of data drives into your 8 config. Now this may not seem obvious, but if you had, for example, three drives and tried to define a group size of four on that it doesn't work because you don't have enough drives to actually provide the level of redundancy that you're asking for. You only have three drives. You can't get four drives with redundancy, um but you can define smaller than any number smaller than the uh the group size nice either than the d-ray size.

A

um You can specify the spare capacity as a drive count. So in this example, d raid 1 colon 5d coolant 2s, says I want 2 drives worth of spare space.

A

So I uh that means that we will take out of the total uh uh specified drive drives for the g-rate config, we'll reserve two of those as sort of spare space worth of drive capacity, um and uh we now conf expose for the in the status uh the as many as we can of the actual configuration. So because the you know the groups are now for pseudo groups within the config. There's no easy way to say all right.

A

These four drives belong to this group and these four drive blind list group, as you would see in a raid z. Instead, it's random, so we present it as a top level, g-raid and all the drives, but in the v-ray name we tell you how it's been partitioned up logically and at the end we have our spares and those these are pseudo spares again because we haven't reserved physically to drives. We have reserved two drives worth of space, and these two represent the pseudo handles to access those that drive space.

A

It's been distributed across all the drives all right. So I want to move on now to talk about sort of the various uh issues that we ran into that that brian and I had to address, as we were, trying to get this code ready for integration.

A

um Everything I've talked about to date has was pretty much there in the bits when, uh when we took over the the bits for for integration, um one of the first or one things we we had to deal with was uh the permutation array um in the implementation.

A

uh You know. In general, you have to have a permutation array that defines the permutation for each slice of the config of the d-rate uh configuration and that had been generated at pool creation and stored in the label, um and it was critically critical. Information had to be present when you first load the pool, because all data can has to go all access to the pool has to go through.

A

You know, figuring out, the permutation and hat was put in every label, because it was critical data that if you lost that data, you could not access any data in the pool anymore. You didn't know how to reconstruct the chunks of of groups and slices, and you know you would if you lost a permutation, you know how to fit all that data back together again, um so there were sort of two issues there.

A

One was the sort of the fact that we were using up a lot of label space for this, and two was the fact that um if we lost it, you're, that's really really bad. um So the answer we came up with was to pre-define, essentially the permutations for a configuration um brian figured out a essentially a means of coming up with a predefined set of permutations, or at least of a seed for for calculating permutations.

A

That was, uh would always produce the same permutation and would be a relatively optimal permutation. And so that is how it's now handled inside of the d-rate code. There's a table of these fermentation seeds that are used to generate permutation on the fly as soon as you bring it up, there's no risk of losing the permutation.

A

It's part of the implementation um and it's of course instantaneous basically because it's all there to start with you're not going through a computation phase of figuring out what your permutation array is going to look like and there's no need to store it in labels anymore. So we've freed back up this space that we've been consuming in the.

A

A

All right, second issue uh that we addressed was um some group size constraints. So in the original implementation um there was this notion that uh the d-rate slice was always equivalent to the d-rate row. So you only only ever had one row in your slice. uh Basically that meant that you had to divide your groups evenly into your uh number of drives in the d raid. So your drives your d rate size.

A

If you wanted even group sizes, that meant that, for, for example, a 30 drive configuration, you could only support either three drive group, five drive group and ten drive group or fifteen drive groups. It had to happen even multiple. You couldn't, for example, support in this configuration with 30 drives and they try a group because they didn't it didn't divide evenly.

A

We did work out uh a variation uh or enhancement to d to d-rate, which allowed you to define uh a configuration which used uh different sized groups to fill out the the set of of groups that went into the row.

A

um This allowed us more flexibility and allowed us to find group sizes that were uh more very more close to close to what what you were looking for. As a consumer, as a user of this this capability, the problem was that um it was not optimal, obviously um and you'd end up with these sort of some groups, one size, some groups for another size and it's hard to reason about a pool when you have different sized groups like that in terms of performance, at least um so.

A

The the better answer was to introduce this notion of tiling.

A

So, instead of requiring that each slice have just a single row in it, we say well, the number of rows in the slice is actually going to be the least common, multiple of the size of the group and the size. The d-rate, and this allows us to tile in uh as many groups as so that we get an even multiple and we'll evenly fill up the space in the particular permutation slice that decouples, basically, the group size from worrying about the number of groups.

A

So we the in earlier iterations of d-rate, you had to specify not just group size but number of groups. Now you don't never specify number groups, you only specify your group size and the number of groups is generated for you automatically based off of this.

A

Math, the next issue which um we had to address was uh stranded space.

A

um I mentioned already that um a block allocation um has to fit is, is is padded out to it to each uh to the full group width, but an additional constraint was that the allocation had to fit into a single group row.

A

That means that.

A

Your group was, you know, for each uh group row, you have a chunk basically of 16 megabytes, but the next row is not necessarily on the same drive so as you're allocating into it. You actually have a different set of blocks, a different set of drives, which represent the next row in that group.

A

This did not work well with the logic in raid z, because that that's not ever an issue with raid z, raid z. You always have the entire uh drive space to deal with in the columns you never had to switch drives in the middle of a of a group or sorry of the middle of the block. um What that meant, though, for d raid, because we had this constraint- was that you get in a situation where you've been.

A

You can fill up a group with a 16 meg chunk and get to a point where all right, I'm trying to allocate uh this block and there's not a space, and so I'm going to move over to another group to allocate it and have more space in it, and you can end up with bits of space at the end of groups which are not easily allocated all right. So if you end up stranding the space now in theory, you could eventually use that space um with smaller block allocations.

A

But it's it was an awkward uh model where you could end up with, particularly in in environments where you're doing a lot of large block allocations, which is what is typical in a d-rate configuration where you have a lot for these small chunks of space which you're never going to really make use of, uh and that's given that there are a large number of rows in a large configuration with multiple drives and large drives.

A

That could be a problem. So the answer we came up with here was to leverage the multi-row allocation maps that were developed for the raid z expansion project. Thank you very much matt. um That allows us to actually define a block, pointer or block allocation that will can span two separate uh groups. So the first row will say this: this. These columns reside on these set of drives and the second row says, and these the rest of these columns reside on this other set of drives.

A

So now we can split a block across two groups, and so we can pack information and not have to worry about this problem. So this constraint goes away.

A

Next uh and finally, we had to deal with um the issue of space inflation.

A

um This was a problem that is similar to the raid z problem, where we have to allocate additional blocks to fill out the the constraint for the actual allocation size, so parity plus one in case of raid z, um but it gets worse with d raid because we're doing entire rows uh or entire entire group rows anyway, and so uh we now are potentially allocating or or filling a number of sectors that uh are not part of it, the data itself, um so this can be particularly significant if you're writing a lot of small block data.

A

Let's take go back to our example of let's say we have a eight wide stripe uh with say two parity, so we have 3k and it's a 5 12 sector size. So we have 3k of space, then is our sort of minimum allocation size? So if you have any allocations, you, you start saving blocks that are small in 3k, you are going to consume 3k of of data space for those blocks regardless, so the minimum allocation size as the strike width gets bigger.

A

The no allocation size gets bigger, and this is even worse if you're looking at a 4k allocation sector size, suddenly there's a significant amount of space potentially being left on the table to fill out these. These.

A

Rows, what's also important to to realize here is that uh in in d raid we explicitly zero fill these these extra sectors. uh So when we write them, we actually have write out zero filled rights to fill these sectors. That's important from the sequential resolver perspective, because we need to be able to evaluate parity from arbitrary rows where we don't know where the data is and where the the fill blocks are zero filled. uh Data is critical to us.

A

um So the answer here there is, you know it's not perfect. Answer allocation classes go a long way towards helping here, so you can simply define some allocation classes, some additional capacity in your pool and say all right. Small block data metadata should be written out to these that this these drives, which are um configured in a more optimal configuration for smaller block data. um The other is to tend to is use large blocks right, d-rate is, is definitely tailored uh in certain ways for large block data.

A

It wants the larger the block, the less overhead you're, going to see with this kind of uh fill inflation um percentage-wise per block, it's a very small, so we at gray tend to write one mag, two mag up to 16. We usually use the large blocks all the time, and so you know, even if we, our our blocks, don't fit. Even we can't aren't divided evenly into our group width.

A

The overheads are minimal for us and so that that helps um but you're always gonna have to to be aware of these overheads. You end up seeing situations where you may see your space being used up faster than than you realize it. So this isn't we're not straining any space here, but we are inflating your allocation sizes, uh and so uh the way that manifests is going to be that it seems like my data, my pool fills up much faster than I anticipated, because I wrote out a whole bunch of small files.

A

A

So what about drive replacement performance um again, one of the main uh focuses of d-raid was all about uh figuring out how to improve our device, replacement and device rebuild process, and so with a d-rate configuration using sequential re-silver.

A

The time it takes to do a device replacement is significantly faster than the time it takes to do it where you're using a single distributed spare.

A

The spare capacity is again using all the drives, um and so we're reading from all drives to pull in the data necessary reconstructing using data from all drives simultaneously.

A

Writing across all the drives to fill in the hot spare data for small groups. So in a large configurations in this example here we have 90 drives with a you know. If you have small uh a lot of small groups, you're going to see some amazing performance, as you see here, we're getting a 10x, faster replacement versus even with a large, relatively large group size here, 15 blocks, 15 side blocks, 15 wide stripes you're, seeing still significantly faster, rebuild rates, and that's that's just a win and that's the most important feature I think for d-raid.

A

It is important to notice that you know it does improve even when you're doing wide stripe, it's still a big boost. We have examples uh in crayola we actually use.

A

You know a d-rate configuration with just a single group in it, and it still gives us benefit because of the fact that one we've gained the spare as this extra spindle in the config and two, even though the reads are not any faster, which will end read across all the drives. The rights are happening faster because we're actually writing to all the drives for the spare space, and that's that tends to be a limiter in device. Reconstruction.

A

All right, um so when is the rate going to be available? um As I mentioned, we originally targeted uh zfs20 release um brian, and I worked really hard um to to make it.

A

At least I twisted his arm hard when it got down to uh to late august and he wanted to cut to a release, but um it just wasn't quite there uh the that we still hadn't finished all the issues that just covered, um and there was you know a bit of work to harden and and fix up uh various issues that showed up in the code as we were doing our work, so um we ended up getting pushed out of the 200 release um and we're now targeting the 201 release.

A

So I think I think we're doing well for that. um The current status of the pull request is that uh it's it's converging uh it's you know. I think brian said that he has almost everything that he wants in there in terms of issues addressing all the issues he's seen with uh zts uh and d-loop issues that he's encountered while doing a lot of testing.

A

So I think it's pretty hardened. From that perspective, uh we have been doing a bunch of testing at hp. Cray to hammer it with our workloads and it looks pretty very solid um at this point. I think we're largely waiting on final cut reviews for the release, um so fingers crossed in the next few weeks, perhaps by or at least by end of year, we'll have that in and the 2021 release out late this year early next year. Time frame, brian can correct me if you decided otherwise.

A

So one of the next steps, uh obviously we need to get this thing integrated. um I said there was a couple of small issues that are being closed up and getting the final code reviews done, um but it's it's almost there um and the next thing we're kind of looking at one idea. We're kind of looking at beyond uh is the rate expansion.

A

So uh the idea is that, could you add a drive to a d-rate configuration um and I think the answer may be yes, so I think this could be an interesting thing to look at uh so, if anybody's interested in in thinking about that, maybe working on some some prototype code, um let's uh get together at the uh code-a-thon and see what we can do all right with that.

A

You move back to q.

B

A all right, so um yeah, we have about six questions now, so there is a few from yan first one. uh What makes a permutation bad or uh optimal for the raid.

A

So the whole idea of permutations is to derive a um a set of of laying out the beta such that you get as one you you get. You randomly distribute your your data across all your drives, so that, as you do your I o you're, hitting as many drives as possible all the time, and you have to also in such a way of course, that you don't have any overlaps that will require what that would destroy your uh redundancy. So you know, if you lose a drive, you can't.

A

Actually you have to be able to preserve the data if you have parity. So if you have single parity, you should be able to lose a drive and not have a problem. If you have dual parity, you need to be able to lose two drives and have a problem, so the permutation algorithm is essentially going through making sure the permutations are, do not destroy the redundancy and, at the same time, give you a as random layout as possible for your data.

B

All right, um jan, has two more questions, but I'm going to jump around someone else for now, so that we can get everybody a chance uh in practice. How has the raid affected compression? uh That is, how much of the compression gains are lost in padding in your large large block workload.

A

So in general I don't I haven't seen a huge impact of the you know compression impacting throughput. I mean you know the the large block. You know if you're compressing your your large data down to nothing. You know that's or it's a very small block. Of course that's going to change the dynamics of of the situation.

A

Your logical throughput is going to remain very, very good. Your actual throughput, because you're doing a lot of smaller ios may drop, but that's not really an issue. I think uh you know what you can tend to see with uh when you start layering on compression on top of a d-raid. Is that the d-raid, the compression uh is obviously going to consume processor to do that work and that may end up slowing down your throughput, because you're spending now a lot of time doing the compression work itself.

A

But I don't think it materially impacts your layout beyond the fact that you, uh you know as you're laying down smaller blocks, you may end up with some more padding than you would have had if you had laid down larger blocks, but since you're compressing you're saving. So uh it doesn't seem like a problem to me.

C

Can I give an uh example out um so like if, uh if using the default of eight wide, um like an eight wide group right then and using 4k discs, then that means you know? Your allocation unit is 32k.

C

So if you're, taking like a 128k block and you're compressing it down, then you know you're, probably adding on average, like 16k extra, which is like what is that like 15 or something uh more space, so you know maybe, instead of two to one, your ratio is like two is like 1.7 to 1 or something.

A

Right right, so so you're yeah exactly so you don't get as much compressibility as you'd, otherwise get you're still getting compression again you get when you get that extra padding going in.

C

And then maybe you want to go to the question about um small block sizes, because that kind of ties in to this as well there's a question from jan about. Is there a script to calculate optimal d-rate layouts for small block sizes like 8k or 16k.

A

So there's no script for calculating optimal theory. Layouts. I think um you know you as a as a assistantman could could say all right, my if my workloads are going to be comprised of a lot of akio and so matt actually raised this uh question to me. Earlier of you know, what's the recommendation, for example, if you're creating z vols, and should you change?

A

Maybe your default block size in that case and, as you just said, yeah, if you're using eight wide stripe and your average block size is 8k, the moon allocation is 32k you're not going to be happy, so there's definitely a situation where you have to balance and say all right. If, given a the average block size, I can't you know a very large stripe. Width is not going to be a win for me, and so you need to account for your minimum allocation size in in your calculations.

A

So your min your minimum allocation size is going to be. You know your stripe width times your sector size and that's what you need to base your your own calculations. In terms of how do I lay out this d-rate.

C

Yeah, so you don't really need a script, because it's so simple, like it's a lot simpler than than uh raid z, actually so like. If you have, if you're using record size, 16k you're, not using compression using 4k sector disks, then you know you have four four sectors per block, so you want to use a d raid. That's group size, four or a factor of four, so basically, four or two um will will lead to no additional padding there.

C

But uh you know you might want to use compression.

A

Yes, because uh because database file is compressed.

C

A

Yeah yeah exactly and so yeah. Basically after just yeah, it's it's a calculation of what is your your average allocation size you think you're going to be, and then you can factor that into saying. Okay, here's, here's a group that makes sense you.

C

A

The thing is you people want everything. Of course you want to be able to say I want. I wanted a you know: 8 wide 16 wide strike, because that's so much more efficient with my space, but if you're doing a lot of small ios is you've lost all that gain so.

C

All right do we need to skip out skip out for the.

C

Should we should we call it, and let folks go to lunch? Is that what you're going to suggest.

B

No, I was I want to know because, like this is kind of the last presentation, there is quite a lot of questions.

C

So what are you thinking again? Is it going yeah since there's so many questions? I would say why don't we keep it going here and if, if folks need to drop off for lunch, then go ahead or if folks want to go to the breakout rooms, then you can go ahead, but it seems like there's a lot of interest on this. So why don't we keep going here for those who want to continue.

B

All right, so I will continue the questions in the chronological order, so from yan uh could raise the expansion work on the raid.

A

So that's exactly so rate z, expansion, some so matt as I assume people are familiar. Aren't people familiar with your raid z expansion work. um That is what we are essentially leveraging the same ideas from that for the the d-rate expansion concept. um So it's it's a little more involved and more complicated. A little more space needs to be preserved. We believe I mean we've sort of thrown around some ideas. I think you know I have a sort of preliminary design in my head around it, but um it's it.

A

Does it's essentially the same conceptually very similar idea.

C

Yeah we come to the hackathon tomorrow because I think mark is going to be uh hopefully digging into this some more and I think that uh it is conceptually simpler in a lot of ways than uh raise the expansion because, like with d-rate, you know the width of all of the like data to parity stuff.

A

Yes, true, that's true, because.

C

And when you're moving stuff around uh like dear it already has the concept of like this, is the logical width of my thing and then, like you, have more drives than that right, like you, have your group with is eight, but you have like you, know, 37 drives, and so I'm like taking those eight and kind of like right.

B

That's right, you know.

C

Storing them into that into the slice. So when you do d-rate expansion, you know you're just like reshuffling that, but the groups are all going to stay the same right.

A

Exactly that and that's key right, your group size doesn't change which can raise the expansion your group size, changes suddenly and that that causes hell so yeah. I agree.

B

Okay next question: uh another one from jan: how much does d-rate improve, read latency during re-silver or rebuild and read latency peaks more precisely.

A

So read: latency of other processes is what he's interested in there. I think um yeah I think, compared to like raid z, yeah, you know, I think that um it that is still from from the work that we've done measuring it. I think, there's still I don't know you know.

A

Obviously you can tune your your re-silver rates to to try to compensate for what you want in terms of other workload latencies, but we have not done enough work there to be able to tell you definitively near the right settings for this much impact or that little impact on your on your workloads during result process. um You know in general, you're going to see that your reads are you know you're you're competing your reads across all the spindles, but you are using all the spindles all the time. So it's it's.

A

uh It's going to be an even match, and so it's just a matter of saying. How do you want to balance your resilient reads versus any other reason might be ongoing.

A

That makes sense.

B

Yeah makes sense to me um a question from ryan, uh since this is built on the existing raid z code. Does that mean it already has vectorized map operation.

A

C

A vector I.

A

Sorry vectorized.

C

Like there's avx and stuff like that, yeah, I'm pretty sure it does yeah it's just it's. It's literally calling into the raid z code and then that's going to use the you know. Optimized um raid z. You know correct instructions, correct.

A

B

All right next question: anonymous attendee. How does drive replacement work with derate since you are not working with physical drive representations.

A

So that you know, obviously you still have physical drive issues here. So when you do it, when you have an issue with a drive failure, it's a physical drive failure. You don't you don't logically lose drives, you physically lose drives, um and so uh you know the difference is that in g rate, in raid z, when you physically lose a drive, you see all right. I lost this drive, which is this column out of my this particular grade z group.

A

In d raid, I've lost this physical drive, which is impacting all of my groups evenly pretty much because if you have a random data distribution, and so when I put in replace it with a new physical drive, I'm going to pull from all of the other groups, all the groups to rebuild onto that drive uh or and and if you have a hot spare or sorry distributed hot spare, you're going to actually rebuild on the distributed hot spare and then, when you actually physically replace the drive you're going to be silver from that distributed, hot spare onto the replacement drive and so you're again leveraging all of your distributed data.

A

The shooted spare data to rebuild to re recover a physical drive again.

B

All right question from james: I think I understand replacement in terms of return pool to redundancy via distributed spare. But what are the replacement in turn in terms of physically replacing a failed drive? How is a new distributed spare added back to the pool.

A

I think I hope I just answered that question with the from the last question, which is once you've you've rebuilt onto your distributed spare space when you replace the physical drive it resilver is basically from the distributed spare space back onto the the new physical.

B

Right uh question from um sorry for your name: jim leon: uh is the cluster store product a cray uh going to be on zfs the crate distributed storage? You mentioned.

A

So yes, the the the product the base product in at cray now hpe is the cluster store storage product, and that is being uh in this next generation version will have the option of using zfs as its underlying storage.

B

Exciting all right, a question from ted along with expansion, could you do raid d-rate reduction as long as you don't go below the group size.

A

So reduction is, I think, you know conceivable but complicated with with expansion. You have the advantage that you're you're growing your space and so as you're rewriting you know you always have space available to to write it in your your new version of the data once you've sort of copied aside some small amount of it with reduction.

A

You know, assuming you have you know you could do the calculation or have enough space around to somehow reduce this pool and remove a drive, but it seems quite complicated to me to figure out how to do that sort of on the fly.

C

Yeah, I think that you'd essentially have to have the like bought. You know, like nothing, no allocations past the point that you're removing, so that you could just kind of trim off the end yeah, which means you have to change either the way the allocator works or you'd have, to like add it as a add the devices as not their full sizes, or something like that, so that you preserve the end of it not being allocated just in case. You want to remove a disk right.

A

C

You're, probably not planning ahead if you're.

A

C

You want to remove the drive.

A

Exactly remember, this is usually like oops. I messed up.

B

All right question from stuart: uh can you talk about how receiver load affects performance? Don't you find you need to throttle receiver.

A

Yeah, so that's related to that to an earlier question and yeah: um there's definitely going to be trade-offs, I mean- and this is this situation whenever you talk about receivable performance and and and its impact on ongoing workload, every every customer, every everybody has a different feeling about it. Typically, it's like some people, like I don't care. I want my. I want to restore my you know my full redundancy as quickly as possible. I don't care what impact I have on my ongoing workloads.

A

Others will say: no I'm willing to take a risk, let it go a little longer and uh but you know just don't- have too much impact on my ongoing workloads. That's more critical to me, for example. um So you know I think there is. There is some tunables in that space. You can tweak it and play with it um and, as I said earlier, I don't know that we have.

A

The complete answers to like here is exactly the right tunings for this amount of impact versus that amount of impact, um and I think there is actually space in the product project for some follow-on work. We may end up tweaking or adding some extra tunables to allow allow more fine grain control. I don't know, but until we have more experience with it, I can't that's that's the best. I can give an answer to.

B

All right, thank you. uh I think last two questions then the remaining, uh if there is any more, can be done at the breakout sessions, um so one from becky, do you recommend or not using d-rate with nvme drives.

A

I absolutely recommend it, um so this is an example where we do um use uh d-rate on nvme at hpe, on our product um with when we add in the the direct I o work, which is forthcoming.

B

A

As a as a new uh feature in dfs actual notation of d-rate, that of correct io, that is, um we see some some very good performance numbers off of nvme um and it can benefit. I mean it doesn't, have it's not as critical in terms of rebuilds, because nvme is so much faster, but it still works well uh and and it's it's not it's a good choice.

B

All right uh last question from jan: uh does the d-rate rebuild trigger uh sorry? Does the rate rebuild trigger every silver.

A

Yes, so any sequential, no.

B

A

D-Rit rebuild trigger every silver, a so back to our sequential uh construction, uh sequentially silver discussion earlier, um the d-raid rebuild is just every sequential receiver and after sequentially silver, we always trigger a scrub activity, and so that happens as a scrub, not every silver. um But yes, I guess that I think that the the answer is looking for is yes,.

B

Okay, thank you and looks like becky want to squeeze one. Let's follow up, uh so this is going to be the last one for real. Does it require a particular type of nvme.

A

So you know our experiments to date have been on some pretty high end. Nvme drives 3.8 terabyte drives that are capable of doing. You know on the order of four gig, a second sort of transfer rates, um those those work well for us, but I don't think it necessarily depends uh on particular nvme drive type. I meant. Obviously you want to you know calculate in or factor in you know, drive rights per day, calculations that kind of stuff, but you know d rate, isn't necessarily any different terms of its io use patterns.

A

To do any other zfs configuration that you might choose, um so uh I don't think there's a particular nvme type. That would make sense more or less sense.